Persistance,cpp / FileStorage produces and assumes invalid XML (Bug #976)

Added by Linus Atorf almost 14 years ago. Updated almost 14 years ago.

Status:	Done	Start date:
Priority:	High	Due date:
Assignee:	Vadim Pisarevsky	% Done:	0%
Category:	core
Target version:	-
Affected version:		Operating System:
Difficulty:		HW Platform:
Pull request:

Description

The Persistance / FileStore subsytem produces invalid XML files, and assumes some sort of invalid files during loading.

Problem during XML writing:
The generated file has the header:

<?xml version="1.0"?>

So no encoding is specified. The XML standard says:

Note that the encoding of an XML document is never iso-8859-1 by default.

See http://www.opentag.com/xfaq_enc.htm#enc_default
However, the file is NOT saved as UTF-8. It's plain "ASCII / ANSI" standard (use e.g. Notepad++ to find out).

The problems during reading:
Only files produced according to the wrong method above are considered valid, or if they contain the encoding tag "ASCII". The problem is, again: If no encoding tag is present, the standard assumes UTF-8. This is NOT what OpenCV does. OpenCV will throw this excpetion, if it encounters the Byte-Order-Mark, BOM:

CV_PARSE_ERROR( "Invalid character in the stream" );

See [source:trunk/opencv/modules/core/src/persistence.cpp#L1715 persistence.cpp, line 1715]. For details, again, see http://www.opentag.com/xfaq_enc.htm#enc_default

If the encoding is UTF-8 without the BOM, and the encoding tag is missing, too, then according to XML standard, UTF-8 is assumed. OpenCV loads the file, which is nice, BUT: Since OpenCV apparently only supports ASCII, this is again wrong (it should throw an unsupported format error). Maybe a wrong encoding is used during parsing, which is a bad thing.
I'm talking about these lines

if( encoding && strcmp( encoding, "ASCII" ) != 0 )
    CV_PARSE_ERROR( "Unsupported encoding" );

See [source:trunk/opencv/modules/core/src/persistence.cpp#L2147 persistence.cpp, lines 2147f].

To summarize:

OpenCV has to produce valid XML files! An encoding tag has to be supplied to the written file, or the encoding must be UTF-8.
OpenCV must be careful when reading encoding tags and files. Silently accepting its own invalid files is not an option.

OpenCV-XML-Samples.zip - XML files for the examples I mentioned (8 kB) Linus Atorf, 2011-05-23 01:26 am

Associated revisions

Revision 60a0ebbd
Added by Vadim Pisarevsky almost 14 years ago

added optional encoding parameter to cvOpenFileStorage() and FileStorage::open() (ticket #976). moved some implementation parts of CommandLineParser to cmdparser.cpp.

Revision 83fa4d38
Added by Roman Donchenko over 11 years ago

Merge pull request #976 from PeterMinin:num_detections

History

#1
Updated by Vadim Pisarevsky almost 14 years ago

thanks! In SVN trunk, r5157, the problem has been fixed. You can not use utf-8 characters in variables names, but text strings are written and read fine (via cvWriteString()/cvReadString() or the newer C++ stream-like syntax)

Status changed from Open to Done
(deleted custom field) set to fixed

#2
Updated by Linus Atorf almost 14 years ago

Thanks for looking at this. I just applied the patch from r5157, recompiled my app, and tested it by using my method to create an XML file. It seems there was a misunderstanding. I didn't have problems "writing utf-8 characters" via strings. The problem I had, and I still have, is that the whole file created has a wrong encoding.

The problem is, the encoding of the file is ASCII. See screenshot attached and/or try it yourself.

For this, the header of the XML file has to look like this:

<?xml version="1.0" encoding="ASCII"?>

However, OpenCV still creates a file that looks like this:

<?xml version="1.0"?>

So no encoding tag is present. The XML standard says:

If no encoding declaration is present in the XML document (and no external encoding
declaration mechanism such as the HTTP header is available), the assumed encoding of
an XML document depends on the presence of the Byte-Order-Mark (BOM).

We don't have a Byte-Order-Mark. If we put one in, OpenCV crashes during reading with

CV_PARSE_ERROR( "Invalid character in the stream" );

To quote again from http://www.opentag.com/xfaq_enc.htm#enc_default:

First bytes         Encoding assumed
EF BB BF         UTF-8
FE FF             UTF-16 (big-endian)
FF FE             UTF-16 (little-endian)
00 00 FE FF         UTF-32 (big-endian)
FF FE 00 00         UTF-32 (little-endian)
None of the above     UTF-8

This is the problem. We have "none of the above", so UTF-8 is assumed. But the encoding is not UTF-8, which is invalid, and it causes other applications who try to read this file, to crash.

There are 3 ways to fix this:

Let OpenCV just put the ´encoding="ASCII"´ tag into the top root XML tag.
Leave the top root XML tag, but encode everything as UTF-8 before writing it to file
Do something else, with Byte-Order-Mark, but keep it to the standard.

I'd supply the patch myself, but I'm not too familiar with the code in persistence.cpp, sorry.

Thanks very much for taking care!
Linus Atorf

Status changed from Done to Cancelled
(deleted custom field) deleted (~~fixed~~)

#3
Updated by Vadim Pisarevsky almost 14 years ago

ok, I put explicit encoding="UTF-8" into the XML header (persistence.cpp, r5160).

the following sample creates XML that is then normally opened within Safari, Notepad++ and a few other programs:

int main(int, char**) {
FileStorage fs("_a.xml", CV_STORAGE_WRITE);
cvWriteComment(fs.fs, "комментарий по-русски", false);
fs << "a" << string("текст");
fs << "b" << string("еще немного текста");
fs.release();
fs.open("_a.xml", CV_STORAGE_READ);
string s = fsa;
string s1 = fsb;
cout << "read a: " << s << " and b: " << s1 << endl;
return 0;
}

if something still does not work on your side, can you supply some similar example so that I can debug it?

Status changed from Cancelled to Done
(deleted custom field) set to fixed

#4
Updated by Vadim Pisarevsky almost 14 years ago

oops, the formatting is broken. One more try:


#include "opencv2/opencv.hpp" 
#include <string>
#include <iostream>

using namespace cv;
using namespace std;

int main(int, char**)
{
    [[FileStorage]] fs("_a.xml", CV_STORAGE_WRITE);
    cvWriteComment(fs.fs, "комментарий по-русски", false);
    fs << "a" << string("текст");
    fs << "b" << string("еще немного текста");
    fs.release();
    fs.open("_a.xml", CV_STORAGE_READ);
    string s = fs[[a]];
    string s1 = fs[[b]];
    cout << "read a: " << s << " and b: " << s1 << endl;
    return 0;
}

#5
Updated by Vadim Pisarevsky almost 14 years ago

of course, the sample will work if the source code has UTF-8 encoding, or if you converted the strings and comments to UTF-8 before passing them to OpenCV.

#6
Updated by Linus Atorf almost 14 years ago

Hi, I'm sorry I have to reopen this, I'm still not satisfied. First of all, I don't want to write any "exotic" strings to XML files, so I wanted to do standard tests to be on the safe side. I applied the path for r5160, recompiled, and modified your example to this (just the strings changed):

#include "opencv2/opencv.hpp" 
#include <string>
#include <iostream>

using namespace cv;
using namespace std;

int _tmain(int argc, _TCHAR* argv[]) 
{
    [[FileStorage]] fs("_a.xml", CV_STORAGE_WRITE);
    cvWriteComment(fs.fs, "aaaaa", false);
    fs << "a" << string("bbbb");
    fs << "b" << string("ccccc");
    fs.release();
    fs.open("_a.xml", CV_STORAGE_READ);
    string s = fs[[a]];
    string s1 = fs[[b]];
    cout << "read a: " << s << " and b: " << s1 << endl;
    return 0;
}

I can compile this, but I can't run it. I get:

[[OpenCV]] Error: Unspecified error (Incorrect element name ╠╠╠╠a) in unknown functi
on, file ..\..\..\modules\core\src\persistence.cpp, line 5049

I made sure my main.cpp source file is plain ANSI encoding.

When I comment the "fs << " lines, and just leave the cvWriteComment, the program runs fine. However, the file created has the file name: ÌÌÌÌ_a.xml. I don't know what's goint on there...

Interestingly, my old code inside my big project runs fine (although I use Qt-strings all the time and just pass them to the << operator with QString::toStdString() .

----
Anyway, this all let aside and focusing on your last patch: I believe you didn't fix it, or even made it worse.

Before we had the situation:

No encoding-tag, encoding was ANSI. I submitted a screenshot
No we have:
Coding-tag says "UTF-8", actual encoding is still ANSI.

In a previous post, I said as one of the options:

Let OpenCV just put the encoding="ASCII" tag into the top root XML tag.

But instead, you put encoding="UTF-8" in, without changing the underlying encoding.

Because many applications are "conservative in what they do, and liberal in what they accept", this wrong encoding may go unnoticed until somebody actually uses characters from the extended charset. I myself can't remember where exactly my toolchain crashed before, but I was using MATLAB and Java to parse the XML files created by OpenCV when I noticed the problem.

I'm just saying, it can be a huge annoyance if the encoding is wrong. So the straight forward way right now would be to make sure that files produced by OpenCV have this header:

<?xml version="1.0" encoding="ASCII"?>

What happens when people start putting real UTF-8 strings in has to be tested / thought of / decided. Right now, that is none of my concern -- my problem is more basic: If we produce an XML file, it should be conform to the standard...

I included a set of 5 test-files. You can easily use Notepad++ to check the encoding.

1. The original version of [[OpenCV]] when I created this ticket started out with file "01_tagNONE_encodingANSI_invalid.xml". 
 2. Your latest patch applied produces "02_tagUTF8_encodingANSI_invalid.xml". Both are invalid. 
 3. The easiest alternative would be to work with files like "03_tagASCII_encodingANSI_valid.xml". 
 4. In the long run, [[OpenCV]] should support reading files such as "04_tagNONE_encodingUTF8_BOM_valid.xml" and 
 5. "05_tagNONE_encodingUTF8_noBOM_valid.xml", but that is not important to me personally right now.

Please note that OpenCV throws this exception when opening file "04_tagNONE_encodingUTF8_BOM_valid.xml":

..\..\..\modules\core\src\persistence.cpp:2141: error: (-212) .\04_tagNONE_encodingUTF8_BOM_valid.xml(1): Valid XML should start with _

I guess this is the persistence module encountering the binary Byte-Order-Marker.

If you could just change the output encoding tag to ASCII, we'd be in a much better place. A nice introductory tutorial about encodings etc. is this one by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Cheers! I zipped the attached XML sample files so their encoding doesn't get changed by Trac.

Linus Atorf

Status changed from Done to Cancelled
(deleted custom field) deleted (~~fixed~~)

#7
Updated by Linus Atorf almost 14 years ago

Ok, I just noticed one thing, while looking at the encodings in Notepad++ again: If you use UTF-8 encoding on a file that doesn't contain any extended characters, and you leave out the BOM, the encodings UTF-8 and ANSI are indistinguishable. So you convert something with Notepad++ to UTF-8, reload it, and it still shows up as plain ANSI.

Still, if somebody puts in some special character, then suddenly it all makes some difference (doesn't have to be some fancy Asian unicode stuff, I think a french letter like é or a § sign would be enough).

#8
Updated by Linus Atorf almost 14 years ago

And another thing: Since UTF-8 encoding is assumed when there is no encoding-tag anyway, putting "UTF-8" or no tag at all should make absolutely no difference.

#9
Updated by Vadim Pisarevsky almost 14 years ago

ASCII/ANSI is a subset of UTF-8, so any UTF-8 document containing only characters with codes 32-127 is also valid ASCII/ANSI document.

From your notes about the old code running fine and some strange garbage symbols appearing in the printed text strings and the file name I conclude that there is something wrong in your code, or the file, containing the source code. I suggest you to check the source code with a hex editor and make sure it's not UTF-16 file and it does not contain any strange characters. It the code, producing the attached XMLs, is fine, can you supply the code as well?

#10
Updated by Linus Atorf almost 14 years ago

Replying to [comment:9 vp153]:

From your notes about the old code running fine and some strange garbage symbols appearing in the printed text strings and the file name I conclude that there is something wrong in your code, or the file, containing the source code. I suggest you to check the source code with a hex editor and make sure it's not UTF-16 file and it does not contain any strange characters.

I did that before (even checked the Visual Studio solution with an editor). Anyway, I put everything into a clean new project, and now your example works correctly with the latest version.

#include "opencv2/opencv.hpp" 
#include <string>
#include <iostream>

using namespace cv;
using namespace std;

int main()
{
    [[FileStorage]] fs("_a.xml", CV_STORAGE_WRITE);
    cvWriteComment(fs.fs, "aaaaa", false);
    //fs << "a" << string("bbbbääöö");
    fs << "a" << string("bbbbb");
    fs << "b" << string("ccccc");
    fs.release();
    fs.open("_a.xml", CV_STORAGE_READ);
    string s = fs[[a]];
    string s1 = fs[[b]];
    cout << "read a: " << s << " and b: " << s1 << endl;
    return 0;
}

However, if I comment out the line with the extended ASCII characters, I get this error:

        _ASSERTE((unsigned)(c + 1) <= 256);

in file istype.c

It's caused by OpenCV here, persistence.cpp, line 1887:

                    if( !isalnum(c) )

Best regards, Linus

#11
Updated by Vadim Pisarevsky almost 14 years ago

ok, in the latest persistence.cpp I replaced all the is<> with custom functions that should not throw exceptions. Hopefully, it will now work well.

#12
Updated by Linus Atorf almost 14 years ago

Thanks again for your fast fix. Indeed, now the example code runs fine, even with special characters! However, Notepad++ shows the produced XML file like this:

<?xml version="1.0" encoding="UTF-8"?>
<opencv_storage>
<!-- aaaaa -->
<a>"bbbb奶��a>
<b>ccccc</b>
</opencv_storage>

(In my nrowser, the closing tag of </a> is damaged, but that seems to be aTrac/browser problem -- it shows up ok in Notepad++).

In the "Encoding" menu, it shows "UTF-8 (without BOM)". It probably detected this because of the UTF-8 encoding tag. When I switch to ANSI, it all looks ok.

It also looks ok when I manually edit the encoding-tag to "ASCII". Apparently, then Notepad++ chooses the right encoding.:

<?xml version="1.0" encoding="ASCII"?>
<opencv_storage>
<!-- aaaaa -->
<a>"bbbbääöö"</a>
<b>ccccc</b>
</opencv_storage>

(It's great to see Trac coping with all of this :-) ).

Anyway, this was my original point: There are two different things in an XML file:

What encoding is specified, i.e. what is determined by the encoding tag, by the BOM, or the default option.
The actual underlying character encoding of the string data.

Until now, OpenCV always puts ASCII/ANSI. That's why I suggested "ok, let's put the ASCII tag into the XML file".

The good thing for me personally is, as I realized and you pointed out: If you stick to "standard" US 32 - 127 chars, UTF-8 and ASCII/ANSI are indistinguishable...

In the long run, if people put extended chars into XML files via the persistence module, this will cause problems, unless:

OpenCV encodes the actual file content to UTF-8
or the encoding tag says ASCII

I'm mentioning this, because in European languages extended chars ( 128 - 255) are very common, like é, ö, etc.

#13
Updated by Vadim Pisarevsky almost 14 years ago

utf-8 is the encoding we should use. Other alternatives are inferior. It's developer who should care that the written text strings have UTF-8 encoding. OpenCV takes "const char*" or "std::string", so it knows nothing about the currently used encoding. It's not in a scope of computer vision library to handle different encodings, so we just assume utf-8. On Linux, MacOSX and virtually all other Unix systems UTF-8 is the default encoding, so everything should work out-of-the-box. On Windows some software sometimes uses UTF-16, or iso-8859-1 or another single-byte "local" encoding. In this case the written strings should first be converted to UTF-8. I will add this note to the reference manual.

I suppose, the ticket can be closed, right?

#14
Updated by Linus Atorf almost 14 years ago

Oh ok, I get it now. As long as it all works when you pass an actual UTF-8 string, that's fine. (And as long as the documentation says: Use UTF-8 with persistence!).

I understand your point about OpenCV, and I agree. I won't do any further tests, as I'm not using UTF-8 anyway. Right now I'm satisfied, thanks for the heads up and discussion.

Can be closed, thanks again

#15
Updated by Vadim Pisarevsky almost 14 years ago

ok, thank you too! :)
one more question before closing the ticket. If no encoding is specified


<?xml version="1.0"?>
<opencv_storage>
<!-- aaaaa -->
<a>"bbbbääöö"</a>
<b>ccccc</b>
</opencv_storage>

is the file displayed correctly in Notepad++?
I just checked and it works perfectly well on mac. If it improves situation on Windows, I can remove encoding specification from XML.

Regards,
Vadim

#16
Updated by Vadim Pisarevsky almost 14 years ago

some more changes. hopefully, the latest solution will satisfy everybody. In r5253 the optional encoding parameter has been added to cvOpenFileStorage() and FileStorage::open(). By default, no encoding is written in the output file, but the user can provide some, e.g.

cvOpenFileStorage("mydata.xml", CV_STORAGE_WRITE, "iso-8859-1") will create XML with the header

<?xml version="1.0" encoding="iso-8859-1"?>

encoding check in the parser has been removed.
UTF-16 XMLs are still not supported, since the feature would require a rewrite of the XML engine in OpenCV.

Status changed from Cancelled to Done
(deleted custom field) set to fixed

#17
Updated by Linus Atorf almost 14 years ago

Vadim, please excuse my delay.

Replying to [comment:15 vp153]:

one more question before closing the ticket. If no encoding is specified
[...]
is the file displayed correctly in Notepad++?

Yes, it is displayed correctly then. But it's just because of the automatic "encoding recognition" by Notepad++ I guess, which tries to read the encoding-tag.

I just checked and it works perfectly well on mac. If it improves
situation on Windows, I can remove encoding specification from XML.

Whether or not it improves the situation should depend on the app reading the XML file. According to standard, there should be no difference (since the UTF-8 specifier could also be omitted). It seems in the case where it's ommited, Notepad++ did some heuristics on character frequency to decide how to display the text to the user. However, I personally wouldn't say Notepad++ is the "target app" that needs to be supported. It probably just tries to display the text to the user best -- so it might do things differently then a certain XML parser app would do.

Replying to [comment:16 vp153]:

some more changes.

Great, the r5253 patch is probably the most sensible thing to do. It makes everything more flexible and let's the user take care (makes sense, as the user also has to worry about the encoding). I'll personally use the "ASCII" tag for my case now.

Thanks for fixing!

Linus

Also available in: Atom PDF

Login	Password

Issues

Persistance,cpp / FileStorage produces and assumes invalid XML (Bug #976)

Associated revisions

History

#1 Updated by Vadim Pisarevsky almost 14 years ago

#2 Updated by Linus Atorf almost 14 years ago

#3 Updated by Vadim Pisarevsky almost 14 years ago

#4 Updated by Vadim Pisarevsky almost 14 years ago

#5 Updated by Vadim Pisarevsky almost 14 years ago

#6 Updated by Linus Atorf almost 14 years ago

#7 Updated by Linus Atorf almost 14 years ago

#8 Updated by Linus Atorf almost 14 years ago

#9 Updated by Vadim Pisarevsky almost 14 years ago

#10 Updated by Linus Atorf almost 14 years ago

#11 Updated by Vadim Pisarevsky almost 14 years ago

#12 Updated by Linus Atorf almost 14 years ago

#13 Updated by Vadim Pisarevsky almost 14 years ago

#14 Updated by Linus Atorf almost 14 years ago

#15 Updated by Vadim Pisarevsky almost 14 years ago

#16 Updated by Vadim Pisarevsky almost 14 years ago

#17 Updated by Linus Atorf almost 14 years ago