mirror of
https://github.com/jmcnamara/libxlsxwriter.git
synced 2026-05-15 14:15:54 -06:00
[GH-ISSUE #238] Corrupted Excel files after inserting image with filename containing non-latin characters #190
Labels
No labels
awaiting user feedback
bug
cmake
cmake
docs
feature request
in progress
long term
medium term
medium term
pull-request
question
question
ready to close
short term
under investigation
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/libxlsxwriter#190
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @RaFaeL-NN on GitHub (Jul 19, 2019).
Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/238
Originally assigned to: @jmcnamara on GitHub.
If I have an image with non-latin (russian) filename ("лого.png"), I'm using worksheet_insert_image(worksheet, 10, 15, "лого.png"). The library is using under Windows. I have to send a filename without UTF encoding, otherwise I got an error 9. I get "no error" with creation the xlsx file without encoding image filename, but the file is corrupted on opening in Excel (drawing1.xml). I think, it is because of copying original filename to tag "descr" without encoding to UTF-8 (may be there is an another problem, really I don't know). Corrupted file attached
Report.xlsx.
@jmcnamara commented on GitHub (Jul 19, 2019):
Can you add a small working example that demonstrates the issue with the sample image.
@RaFaeL-NN commented on GitHub (Jul 19, 2019):
OK
test.zip
@jmcnamara commented on GitHub (Jul 19, 2019):
Seems to be a Windows only issue. It works fine on Linux/Mac OS.
I'll look into a fix.
@m00k12 commented on GitHub (Aug 9, 2019):
I'm seeing something very similar, writing from Linux.
Passing
worksheet_write_string()a string that contains an invalid ASCII character causes Excel to fail to load the entire worksheet (LibreOffice dropped just the affected column of the worksheet). The example of this was a string containing a byte 0xf6, commonly interpreted as 'o diaeresis' or an 'o' with an umlaut where I am.The fix may be to check that the passed string is valid UTF-8 or pure ASCII and return an
lxw_errorcode and omit the cell so we get a chance of an error at generation time, rather than when loading into Excel.For reference Excel did produce a warning pointing to the invalid character in the input file, it was at column 549511 of a line over a million chars long and took a hex editor to find!
Great work though!
@jmcnamara commented on GitHub (Aug 9, 2019):
@m00k12
Technically it isn't quite the same on Linux.
In general the filename and any strings the you use in libxlsxwriter (or Excel) must be UTF-8.
The issue reported, as far as I can see, relates to reading a Unicode .png file with the zlib interfaces on Windows. I still haven't had a chance to validate it in MSVC but I tried it on Linux and it works as expected.
I think, for now, the user will just have to ensure that they are passing valid UTF-8 strings to the library.
@Alexhuszagh commented on GitHub (Sep 16, 2019):
@jmcnamara, @m00k12 It would be fairly easy to provide a check that a C-string is valid UTF-8.
A comprehensive implementation would be (this should work in C89 as long as the compiler has
stdint.h(non-conforming, but very common) and the header hasuint8_tanduintptr_t, both of which should be likely conditions. It should work in C99 or later as long asstdint.hhas both those types.When running this, I get the following results:
@Alexhuszagh commented on GitHub (Sep 16, 2019):
If the UTF-8 validator has any interest, I'd gladly submit a PR with it included. However, I believe this only addresses the second issue raised, and not the original bug on MSVC (which I will look into later).
@jmcnamara commented on GitHub (Sep 16, 2019):
@Alexhuszagh Thanks for the input, but I don't think that is a feature that I want to add/maintain.
@Alexhuszagh commented on GitHub (Sep 16, 2019):
@jmcnamara No worries, just figured I'd make the suggestion since I've previously implemented this code.
@m00k12 commented on GitHub (Sep 16, 2019):
I used libiconv to clean all strings before output.
@jmcnamara commented on GitHub (Oct 6, 2019):
I don't work on Windows very often so I keep failing to get to this to debug it. Maybe someone else on the thread can have a look.
A test program would look like this:
You can use any png image. The source file should be UTF-8 encoded.
There are two potential issue.
The first is that libxlsxwriter needs to open and parse the image added with
worksheet_insert_image()to read the image metadata such as height, width and DPI. The 'fopen()' on the filename to read the image metadata probably won't work for a file with an non-ASCII name:https://github.com/jmcnamara/libxlsxwriter/blob/master/src/worksheet.c#L5546
The second, potential issue, is when the file is added to the zip file:
https://github.com/jmcnamara/libxlsxwriter/blob/master/src/packager.c#L245
If someone who works on Windows more regularly, and knows how these types of filename handling works, can take a look that would be great.
+ @utelle
@RaFaeL-NN commented on GitHub (Oct 6, 2019):
fopen() in Windows works only with ANSI filenames.
The fopen function opens the file that is specified by filename. By default, a narrow filename string is interpreted using the ANSI codepage (CP_ACP)
Another solution is in converting ANSI filename with MultiByteToWideChar(0,... and WideCharToMultiByte(65001,... before putting it in XML, not before fopen()
Third variant is an option for non-copying filename to "descr" or an optional parameter "descr" in worksheet_insert_image() (with filename by default, if you like this way)
@jmcnamara commented on GitHub (Oct 6, 2019):
I've pushed a fix to the
win_fopenbranch that uses_wfopen()on Windows (203c22e973614ccce6a51b2f1f9d030d37af4363).I build that branch into a library and compile the above code against it. It worked as expected with a sample лого.png image file:
Note, the source file must be UTF-8 encoded.
Can you try it out and let me know.
@RaFaeL-NN commented on GitHub (Oct 6, 2019):
No difference. I am using MinGW, not MSVC. Is a patch applicable for it?
@RaFaeL-NN commented on GitHub (Oct 6, 2019):
P.S. It works with comments #ifdef _MSC_VER line and some lines at bottom, and replacing _MAX_PATH with 260
@jmcnamara commented on GitHub (Oct 6, 2019):
Ok. Good.
Can you try replace _MSC_VER like this, and test it:
@RaFaeL-NN commented on GitHub (Oct 6, 2019):
Don't work. If I undef _WIN32, how can it works?
It works with replacing "#ifdef _MSC_VER" with "#ifdef _WIN32" and _MAX_PATH with 260
@jmcnamara commented on GitHub (Oct 6, 2019):
What version of MinGW are you using? 32 or 64bit and can you send a link to the one you download/use?
@RaFaeL-NN commented on GitHub (Oct 6, 2019):
32bit. Really I don't know version, because I am using it just for making dll of libxlsxwriter and never updating it for years after installing ~3 years ago. You can try the latest one from mingw.org
P.S. For now, the minimal working variant is
@jmcnamara commented on GitHub (Oct 7, 2019):
I understand that this fix works in your case but I'm not convinced that it is a general issue (outside of MSVC).
The
fopen()function should work in MinGW32 without modification. Here is an example:Output:
Can you try reproduce in a more recent version of MinGW using the example above and the more general libxlsxwriter MinGW instructions here: https://libxlsxwriter.github.io/getting_started.html#gsg_ming
@RaFaeL-NN commented on GitHub (Oct 7, 2019):
No. A program, compiled in MinGW depends on msvcrt.dll from Windows. fopen() in msvcrt.dll use ANSI only
@jmcnamara commented on GitHub (Oct 7, 2019):
I don't understand. Why does the MinGW example I show above work with fopen and a non-ansii filename?
@RaFaeL-NN commented on GitHub (Oct 7, 2019):
What version of Windows and what version of msvcrt.dll? Do you understand where fopen() is stored?
@jmcnamara commented on GitHub (Oct 7, 2019):
Windows 10 and msvcrt.dll ver 7.0.16299.
@RaFaeL-NN commented on GitHub (Oct 7, 2019):
Can you attach an archive with utf8.exe and лого.png?
@jmcnamara commented on GitHub (Oct 7, 2019):
I can't attach them but you can rebuild them by following my instructions above, I used MSYS2 and the following setup:
@RaFaeL-NN commented on GitHub (Oct 7, 2019):
I have to check what your filename is in ANSI on disk (EB EE E3 EE 2E 70 6E 67), not in UTF-8 (D0 BB D0 BE D0 B3 D0 BE 2E 70 6E 67). How can you rename it to Win-1251 without russian locale?
@jmcnamara commented on GitHub (Oct 8, 2019):
It would be better if you built the environment with the instructions I gave above and tried it yourself.
Alternatively, you could give detailed steps on how I can replicate your environment.
Anyway, the filename is UTF-8 encoded (it has to be to match the ut8.c example encoding):
@RaFaeL-NN commented on GitHub (Oct 8, 2019):
Well. But the original issue is with filename in ANSI (EB EE E3 EE 2E 70 6E 67) (read first post). In Windows filenames are stored in ANSI, not in UTF-8. If filename is D0 BB D0 BE D0 B3 D0 BE 2E 70 6E 67 it will NOT be displaying at Windows Explorer (and all other programs) as "лого.png", but as "лого.png" (not readable). If I have file "лого.png" (D0 BB D0 BE D0 B3 D0 BE 2E 70 6E 67) of course, I can insert it by lib. But I have not, and nobody have. So, I need a way to insert image with filename "лого.png" in ANSI (EB EE E3 EE 2E 70 6E 67) and to have a correct xlsx on output. You can test it with file from test.zip in third post (you have to unzip it under Windows)
https://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as
@Alexhuszagh commented on GitHub (Oct 8, 2019):
Iconv is great, the main issue is the licensing which is obviously not compatible here. Also, not worth a dependency, I'm guessing.
@Alexhuszagh commented on GitHub (Oct 8, 2019):
@jmcnamara Windows still uses various macros that may depend on your compiler configuration to determine whether
fopenuses narrow or wide paths.For higher-level APIs, this may be controlled via the
UNICODE(and_UNICODE) macros:https://docs.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings#unicode-and-ansi-functions
For our use-case, this is controlled via parameters to
fopen:https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=vs-2019#remarks
I would recommend both of you ensure that you are working on the correct set of configurations, but likely, we should our file utilities work with UTF-8-encoded paths (which will almost certainly require conversion to UTF-16). For example, on WINE, using the following source code, I get
1for code compiled by both i686 and x86_64 MinGW, meaning it is using the ANSI encoding for filenames (which can be highly-variable depending on locale):@Alexhuszagh commented on GitHub (Oct 8, 2019):
A simple conversion utility (and an example showing it works) is as follows (please note this example must be compiled with a UTF-8 source encoding, which for MSVC is provided with the "/utf-8" flag, on MinGW, I believe this is the default):
This should be sufficient to fix the issue, along with any NULL-checks to ensure the path isn't NULL (which may happen due to invalid input, or memory errors) while writing to file.
@jmcnamara commented on GitHub (Oct 9, 2019):
@Alexhuszagh Thanks for that.
@jmcnamara commented on GitHub (Oct 9, 2019):
@RaFaeL-NN
Ok. That should be possible to support. I'll add a workaround soon.
@RaFaeL-NN commented on GitHub (Oct 30, 2019):
Good. Can you move description next to y_scale in lxw_image_options? As you know, I can not use С headers and I have to create my own data structures before sending data to lib. Of course, I can insert dummy empty variables but I hope, you'll change
to
@jmcnamara commented on GitHub (Nov 10, 2019):
Yes, I will. In fact I'm going to refactor the structs used in APIs so that all public fields are documented and available and the structs will be different internally if they contain any additional metadata that isn't public. I'll start with
lxw_image_optionsas a test.Once the
descriptionfield is available publicly you can set it to "" (blank string) and the description/filename won't be copied internally and won't corrupt the file if it isn't UTF8. That should allow you to specify any encoding you want for the program to read the image file.@jmcnamara commented on GitHub (Nov 17, 2019):
I'm going to close this issue. I'd recommend setting the description fields of
lxw_image_optionsto "" (blank string) to prevent any non-utf8 chars getting copied into the file and corrupting it.@RaFaeL-NN commented on GitHub (Nov 17, 2019):
I tested it (user 'description' field) and it works good. So, for now and for me the solution of the original issue is of copying the UTF8-encoded filename to description if user omit 'description' field. I think a commit with lxw_fopen actually does not needed
@jmcnamara commented on GitHub (Nov 20, 2019):
Yes. It wasn't needed for your case. It is still probably a worthwhile fix for people using MSVC though.