mirror of
https://github.com/jmcnamara/libxlsxwriter.git
synced 2026-05-15 14:15:54 -06:00
[GH-ISSUE #64] Corrupted Excel files due to locale dependent conversion of floating point values #55
Labels
No labels
awaiting user feedback
bug
cmake
cmake
docs
feature request
in progress
long term
medium term
medium term
pull-request
question
question
ready to close
short term
under investigation
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/libxlsxwriter#55
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @utelle on GitHub (Jul 8, 2016).
Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/64
Originally assigned to: @jmcnamara on GitHub.
I develop an i18n application.
libxlsxwriterworks fine as long as the default C locale or a locale that uses decimal points as the delimiter for the fractional part of a floating point value are used.The problem is that the C functions used in
libxlsxwriter, namelyfprintf,lxw_snprintf, are not locale independent. That is, for example if a German locale is established, a decimal comma is used as the delimiter in formatting floating point numbers. However, Excel expects that floating point values are formatted with decimal points. Thus,libxlsxwritercreates corrupted Excel files.To overcome this problem, MSVC provides
_sprintf_lthat takes a locale argument, and MinGW providessprintf_l(with a slightly different function signature). However, I don't know whether a generic portable solution exists.@jmcnamara commented on GitHub (Jul 8, 2016):
Can you override the locale at the application level, or at least around the Excel writing part?
@evanmiller commented on GitHub (Jul 8, 2016):
@jmcnamara You should be able to save the current locale, change the locale to "C" or "POSIX", do your printing, then restore the original locale when done printing. E.g. something like
@utelle commented on GitHub (Jul 8, 2016):
How can I encourage you?
Well, in fact, it's my application that sets the locale - dependent on the choice of the user. That is, even on a German system it will use decimal points, if the user switches the application to English.
Overriding the locale around the Excel writing part is not such a good idea, since for C the locale is a global beast - changing it would for example affect also different threads ... with unpredictable consequences.
Since I had to make a release of my application (which uses
libxlsxwriterinternally) yesterday I implemented a quick hack forlibxlsxwriterby defining a function inutility.cthat converts a double to a string in a locale independent way. And that function is called from all places wherelibxlsxwriterneeds to format doubles. It works, but the solution is not generic.In general, I think the approach to separate the formatting of the doubles into its own function is the right approach. However, the implementation should work on all supported platforms.
For my Open Source project wxPdfDocument, where I also have to format doubles in a locale independent way for output to PDF, I wrote my own methods to handle this. However, in fact, it's a bit like re-inventing the wheel.
It might be easier to determine the decimal delimiter of the current locale, format the double using for example
sprintf, and replace the decimal delimiter by a decimal point afterwards.@utelle commented on GitHub (Jul 8, 2016):
@evanmiller: Changing the locale in that way you proposed can have unwanted side effects in a multi-threaded environment, since the locale is a global instance.
@evanmiller commented on GitHub (Jul 8, 2016):
@utelle You are right, my proposal will wreck the locale in a multi-threaded environment.
@evanmiller commented on GitHub (Jul 8, 2016):
What about
uselocale? It is a thread-specific alternative, and part of POSIX.1-2008.http://man7.org/linux/man-pages/man3/uselocale.3.html
@utelle commented on GitHub (Jul 8, 2016):
A function to convert a double to a string in a locale independent way could look like the following:
The passed
bufferhas to be large enough to hold the result of the conversion. So maybe the size of the buffer should be passed to the function, too, and thensnprintfresplxw_snprintshould be used instead ofsprintf.A local buffer should not be used to avoid problems in multi-threaded environments.
@utelle commented on GitHub (Jul 8, 2016):
@evanmiller
In principle, a good idea. Unfortunately, the MSVC runtime library does not provide the function
uselocale. You would have to take special measures to enable thread-specific locales. Not impossible, but cumbersome. So under Windows it takes more efforts to solve this issue.@evanmiller commented on GitHub (Jul 9, 2016):
@utelle This StackOverflow thread may be useful:
http://stackoverflow.com/questions/6561723/thread-specific-locale-manipulation-in-c
Win32 provides
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE)which makessetlocalebehave likeuselocale.@utelle commented on GitHub (Jul 9, 2016):
The more I think about this the more I get the impression that manipulating the global or thread-specific locale might not be the best approach. At least not on Windows.
libxlsxwriteris a library, and a library should not interfere with application specific settings. Using_configthreadlocale(_ENABLE_PER_THREAD_LOCALE)withinlibxlsxwritermight not be what the application wants.IMHO the way to do it is to use functions like
sprintf_lor to mimic their behaviour.My code sample posted here might be a too simple approach, since there might exist locales where the decimal separator is not a single character. So one would have to replace the corresponding string, instead of just 1 character. Additionally, the active locale might request thousands separators - just to complicate matters.
@jmcnamara commented on GitHub (Jul 9, 2016):
I don't think locale changes should be handled in the library either. My current thoughts on how to handle this are:
setlocale()anduselocale()(thanks @evanmiller).fprintf()calls, like all thesnprintfcalls, and allow Windows users to override them with the_lalternatives. Perhaps with some sample functions embedded in the code."%g"infprintf()so that a modification tosnprintf()is sufficient. The directfprintf()handling was added in5173f40ac7 (diff-20827109fa4f6c0bac299b5c9184f20fL2179)as a performance optimization. This could be#definedcode so that the optimization is still available.@utelle could you investigate if replacing
lxw_snprintf()on Windows with a_l()variant will work. You can replace the code inwrite_number_cell()with the code from the diff above in number 4.@utelle commented on GitHub (Jul 9, 2016):
Please don't. For developers like me support for i18n applications is important, and it would be a big advantage for
libxlsxwriterto support i18n out of the box. I will certainly help to find solutions.For i18n applications it would make life difficult for developers. One would always have to keep in mind that locales have to be changed before and reverted after doing Excel export. I would certainly prefer a solution that hides this difficulty from the developer using
libxlsxwriter.And IMHO it is not that difficult: encapsulate the formatting of doubles into a separate function and provide implementations that work for most if not all platforms and compilers. I've done it already, but the solution has to be extended to support at least all major platforms.
As already said above: just encapsulate the double formatting into a function that returns the formatted string. Then the rest of the code can stay as it is by just using
%sinstead of%.16g.See above.
The
_lvariant works definitely for MSVC. However, as mentioned earlier MinGW defines a different variant, and other compilers might not have it at all. So I will do some more digging to find a solution that works in a more or less portable way.@utelle commented on GitHub (Jul 10, 2016):
I did some research yesterday. Since
sprintf_lor variants thereof don't seem to be generally available (MSVC has it, BSD also, but not Linux), and since speed can also be an issue for converting doubles to strings, I would recommend to use a private version ofdtoa.I found this interesting project, dtoa benchmark, which includes a C implementation. The benchmark claims that the latter is about 9 times faster than
sprintf. The license is very permissive.I haven't tested it myself yet, but I would give it a try. That is, encapsulate the formatting of doubles into a separate function, which then uses the
emyg_dtoaimplementation to perform the actual conversion.@jmcnamara commented on GitHub (Jul 10, 2016):
After thinking about his for a bit I am now leaning towards handling this in a sprintf wrapper like your code above. I'll also have a look at the dtoa code.
I'm going to tackle the tempfile issue first and then move on to this.
Thanks for the research. That is very helpful.
@utelle commented on GitHub (Jul 10, 2016):
Conversion from double to string happens 7 times in your code (1 x
chart.c, 1 xcustom.c, 1 xxmlwriter.c, 4 xworksheet.c). Except for 2 cases inworksheet.cwhere you usefprintfwith surrounding text elements in the format string, it is always just a plain conversion from doube to string. That is, it should be easy to isolate and encapsulate the code for the conversion.Personally, I would probably go for the
dtoacode, although plain replacement of commas by points should work in 99.99% of all cases. Either point or comma are used as decimal separators all over the world, except for a few East Arabic countries where another character is used (see Wikipedia: Decimal mark).Solving this issue would certainly give
libxlsxwritera competitive edge - at least for developers of i18n applications. Most Excel export libraries I came across suffer from the very same problem.@utelle commented on GitHub (Jul 11, 2016):
In the meantime I tested
emyg_dtoain my local version. I had to struggle a bit to get it compiled with MSVC 2010, since the code uses C99 features that are not supported by MSVC, but finally I managed to get it working. However, I wonder whether theemyg_dtoacode was ever compiled successfully with MSVC as is, since the code seems to contain items that are not even known to MSVC 2015. I will check with MSVC 2015 later on, too.As far as I can tell
libxlsxwriterfunctions properly using that code.I defined a function
lxw_print_doubleinutility.cwith the following signatureand replaced
lxw_snprintfbylxw_print_doubleat the places where the format code%.16gwas used to convert double to string (except for the calls tofprintfwhere I changed the format code to%sand replaced the double argument by a call tolxw_print_double).@BongoVR commented on GitHub (Aug 2, 2017):
@utelle @jmcnamara
Is the libxlsxwriter code compiled into the executable?
This problem does not seem to arise if the libxlsxwriter code is compiled into a regular DLL (aka USRDLL) statically linking to the C runtime on Windows which is not loaded at startup (compiler switches
/D "_USRDLL"and/MTfor Microsoft Visual Studio).My scenario is pretty much the same as the OP's. The application is calling
setlocale(LC_ALL, "")during initialisation - whatever that may mean for a particular user.However, the libxlsxwriter DLL is consumed as a so-called "delay loaded DLL" since I am using the code only for file export (
/DELAYLOADlinker option for the exe in Visual Studio). The DLL is thus not loaded until the user actually needs the DLL. Since its DllMain function lacks a call to setlocale, the DLL is happily using the "C" locale. Therefore, I never experienced that problem.I myself am running the application using a German locale and xlsx creation worked fine so far. No thread-unsafe calls to
setlocalenor tinkering with user-defined special functions likelxw_print_doublefor formatting floating point values have been necessary.Let me emphasise that either the delay-load feature or the static runtime linking alone might well do the trick.
@jmcnamara commented on GitHub (Aug 2, 2017):
@BongoVR thanks for the information.
@utelle commented on GitHub (Aug 6, 2017):
Using the static runtime library of MSVC in a DLL can cause a lot of problems, if not used very carefully, especially since the DLL will use its own memory management. Besides that this "solution" is MSVC specific.
Therefore I would prefer a generic solution that generates correct output for variables of type double independent of the locale.
As mentioned in my post of July 11, 2016, I adjusted the libxlsxwriter code to use a locale independent function for generating the string output for doubles. And finally I dedicated some time to apply the changes not only in my local development environment for my own application using libxlsxwriter.
I committed the changes to my libxlsxwriter clone - after syncing it with the current version of the original. I struggled a bit with adjusting the various Makefiles, but now the code compiles at least without errors. However, the CI on Travis fails in the step verifying the generated test files. @jmcnamara: it would be nice if you could take a look - maybe I just made a simple stupid mistake; thanks in advance.
@jmcnamara commented on GitHub (Aug 8, 2017):
@utelle I will probably implement a macro for
snprintf()with"%.16g"so that the user can override it with their own function. I'm not in favour of adding dtoa at this time however.@utelle commented on GitHub (Aug 8, 2017):
Any specific reason for not using dtoa?
All what I can tell is that I use it in a commercial application and it works flawlessly. That is, the resulting files can be opened by Excel without any problems.
IMHO you should at least isolate the output of double values into a separate function as I did, so that a change is necessary at a single place only.
@utelle commented on GitHub (Aug 8, 2017):
I still think that
libxlsxwritershould provide a locale independent solution for writing double values right out of the box instead of leaving it to the user of the library to build some sort of workaround (by using compiler specific functions like_sprintf_l, by isolating the library into a separate DLL as proposed by @BongoVR, or by any other means).@utelle commented on GitHub (Aug 8, 2017):
I modified the
dtoacode to not emit a decimal point and a trailing zero for doubles representing integer numbers ... and now CI on Travis passes all tests.@jmcnamara commented on GitHub (Aug 9, 2017):
I looked at the emyg_dtoa.c implementation last year when you posted it and again today. I couldn't get it to compile without modification and it doesn't have any significant test suite. I see that you also had to patch it to get it to work. So for me it is a non-runner.
Of the others in the benchmark the Gay algorithm should be stable and tested, and it compiles everywhere but it is slower than
sprintf(). Although that wouldn't be my main consideration.Also
fpconvcompiles okay, runs faster thansprintfand seems to be maintained, so that may be an option.Either way I would have to write a test suite before I added any of them.
Yes. I will do that as step 1 in the near future.
As above my main concern is finding a library that I can stand over. When/if I identify a suitable library I'll add it as a step 2.
@utelle commented on GitHub (Aug 9, 2017):
Unfortunately you are right that the dtoa benchmark project seems not to be the best source for a
dtoaimplementation. Yes, I had to tweakemyg_dtoato make it compile, but as far as I can tell the algorithm works properly.If you tend to use the
dtoaimplementation of David Gay, then do not use the sources from the dtoa benchmark project - they are outdated. Get the latest version from netlib.fpconvmight be preferrable to Gay'sdtoa, since it is a lot faster. In fact, it is based on the same algorithm (grisu) - originally developed by Florian Loitsch - asemyg_dtoa.That you intend to do step 1 (isolating output of doubles) in the near future sounds really good. Then one can drop in whatever
dtoaimplementation without too much hassle.@m1h43l commented on GitHub (Mar 12, 2018):
Hi,
I just want to add my 2 cents. I am using this library on IBM i with the EBCDIC CCSID 273 (German) and the locale de_DE. This setup configures the decimal point as a comma which results in an invalid xlsx document. This library uses snprintf for transforming the double value to a character value. snprintf does not use the DECFMT job value but takes the LC_NUMERIC value into account which can be set as an env var for the job.
So users with a comma as a decimal point sign can change the env var LC_NUMERIC to get a correct document.
@utelle commented on GitHub (Mar 12, 2018):
@m1h43l: for your use case the solution to set the decimal separator via environment variable may work sufficiently well. However, in the general case this solution has the side effect of changing the decimal separator globally for the system/user. This may not be acceptable for some user interface applications.
IMHO the preferred solution should be to make the library locale independent. In fact, this is not a big deal as I demonstrated in my own fork. In my own applications my approach works flawlessly ever since.
@m1h43l commented on GitHub (Mar 13, 2018):
@utelle I totally agree. The application should work without having the user to modify the environment. But for the time being it is better to have something than to have nothing and thus the workaround with the env var.
@jmcnamara commented on GitHub (Apr 25, 2018):
I've implemented a partial solution for this: I've added a function called lxw_sprintf_dbl() which can be enabled via the
USE_DOUBLE_FUNCTION#define:By default (when
USE_DOUBLE_FUNCTIONis not defined)lxw_sprintf_dbl()is just a macro.See commit
14421b6629for details.The internals of the
lxw_sprintf_dbl()function can be replaced with something more complete such as theemyg_dtoafunction you use.I'll look into adding something like dtoa as an optional compile at a later stage.
@utelle commented on GitHub (Apr 25, 2018):
Great news. A step in the right direction towards a general solution of the problem.
However, I really don't understand why you chose to use a macro if
USE_DOUBLE_FUNCTIONis not defined. IMHO that makes the code less clear (for example inworksheet.c).The variant of
emyg_dtoaI use in my fork works flawlessly producing the exact same results as your original library.@jmcnamara commented on GitHub (Apr 25, 2018):
For performance reasons in
_write_number_cell()function which is in the inner loop. Although, having said that, I haven't performance tested both variants yet and an inlined function might be as fast as the macro. I'll look into that in the next few days.@utelle commented on GitHub (Apr 25, 2018):
I doubt that one level of indirection really causes performance problems compared to the time required to do the actual double formatting. At the moment your macro approach looks like premature optimization to me.
BTW, I updated my fork to your latest version, but using my
emyg_dtoabased implementation for your new macro/functionlxw_sprintf_dbl... it still passes all tests. 😄@hdijkema commented on GitHub (Aug 15, 2019):
Please document the 'USE_DOUBLE_FUNCTION' somewhere in CMakefile.txt or in some building instructions. I have this problem on a Linux system (not on Windows) and I had to look into the generated excel XML to see what caused the problems. Also it doesn't seem to work with cmake? I did:
Eventually I added
'#define USE_DOUBLE_FUNCTION 1'incommon.hduring library compilation and removed it when installing...@petricf commented on GitHub (Sep 19, 2019):
Can you integrate the following patch:
https://bugs.gentoo.org/attachment.cgi?id=590372
For reference the complete bug report is at: https://bugs.gentoo.org/627452
This adds the cmake option USE_DOUBLE_FUNCTION=(ON/OFF). Allows to be defined at cmake call time.
Useful for systems which build automatically from source (like gentoo).
@jmcnamara commented on GitHub (Sep 20, 2019):
@petricf Will do. Thanks.
@utelle commented on GitHub (Jan 20, 2020):
Instead of continuing the discussion in #265 (under a misleading title) I'd like to add some information here.
Outside of libxlsxwriter I compared the implementations
emyg_dtoaandfpconvfrom dtoa-benchmark. Thefpconvimplementation is identical to night-shift/fpconv. Not sure, whichfpconvimplementation you had in mind.emyg_dtoaas well asfpconvare based on grisu by Florian Loitsch.Interestingly the results of
emyg_dtoaandfpconvdiffered in roughly 43 % of 100.000 test conversions, although the results varied only in the last digit of the output. It should be tested whether this would have an effect on the results of the libxlsxwriter test suite.Additionally, I found ryu by Ulf Adams with implementation(s) here.
@jmcnamara commented on GitHub (Jul 11, 2021):
Closing this as a duplicate of #272