mirror of
https://github.com/jmcnamara/libxlsxwriter.git
synced 2026-05-15 14:15:54 -06:00
[GH-ISSUE #221] constant_memory extension for maximum-sized shared-string table? #178
Labels
No labels
awaiting user feedback
bug
cmake
cmake
docs
feature request
in progress
long term
medium term
medium term
pull-request
question
question
ready to close
short term
under investigation
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/libxlsxwriter#178
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mewalig on GitHub (Apr 4, 2019).
Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/221
Originally assigned to: @jmcnamara on GitHub.
I understand that in constant_memory mode, the shared string table is unavailable because you have to keep it in memory for lookups, and you can't know ahead of time how big it might need to be.
I have to create large XLSX files that have a LOT of repetitive strings. Using constant memory is a huge benefit, but the resulting XLSX files are unnecessarily very large (one might think that the zip process would work well on repetitive strings in XML... but it doesn't). Since the number of strings is relatively small, a shared string table probably would not require much memory, and would be hugely beneficial in creating optimized XLSX results.
So, I'm wondering if it would be feasible to provide an additional memory option, which would be to have a fixed-sized-memory string table. The sst is used if an existing match is found, or if memory is available, and would fall back to inline strings if not.
So it would look like this:
Would you see anything in the above approach that I'm missing to make this feature possible? If not, and I get it working, would you consider adding it to the master branch?
@mewalig commented on GitHub (Apr 5, 2019):
Well I went ahead and did this. Added
sst_memorytolxw_workbook_optionsto define the maximum memory that should be used for storing the SST before it starts using inline strings. Below is example usage. Attached is a patch. I'm hoping you will find it useful, in which case, would you be willing to merge (assume this code passes any necessary qc tests-- and if it doesn't pls feel free to lmk and I'll fix...)constant_memory_with_sst.patch.txt
@jmcnamara commented on GitHub (Apr 5, 2019):
How large? And how long do they take to create?
Yes. I was surprised by that as well. Have you tried increasing the zlib compression level? It will add to the processing time but the files may be significantly smaller. You could test this outside the C library by unzipping one of your files and re-zipping it with system zip and different compression levels.
I've considered, in the past, adding a SST + single row hybrid of the two existing modes. The main motivation was that some applications, in particular Apple Numbers, didn't support the inline-string element. It would also allow the creation of files that were the same as files created by Excel, which is a primary goal of the library. However, recent versions of Numbers now support inline-strings so that is no longer an motivation and it has been a long time since anyone reported a third-party app issue with inline strings.
If you implemented something like that would your memory usage/file size tradeoff by acceptable?
Unfortunately no.
The hybrid model described above is as far as I would be willing to go in terms of complexity and maintainability. But even then there haven't been any other requests for a feature like that.
Libxlsxwriter is a port of the Python module XlsxWriter which in turn is a port of the Perl module Excel::Writer::XLSX, all of which I wrote and maintain. So a new feature needs to be added to the Perl versions first, then to the Python version, and then to the C version. I don't intend to add any non-language specific feature that isn't in all three versions. And, as I said, there currently isn't anyone else asking for this feature.
Nevertheless thank you for your input. Your patch may be useful to someone else with a similar usecase.
John
@mewalig commented on GitHub (Apr 5, 2019):
OK, did not realize that about the Python and Perl versions. Actually I have more use for the Python version, I just didn't realize until now that it had the constant_memory mode. Do you automate the Perl -> Python -> C conversion or maintain all three code bases by hand? I'm asking because I may have use down the road for both a Python and a C version of this feature and I'd be interested to learn how you maintain to weigh the work it might take if I were to do so in a separate fork.
Re your earlier questions, large would be, hundreds of thousands of rows, hundreds of columns, and maybe half a dozen worksheets each with that much data. The saved worksheets are about 100MB. I don't have figures on different zip compressions at the moment-- I would have to first convert saved-with-SST to saved-without-SST and don't have a tool on hand for that (though I'm guessing it's close to a 1-liner with this library).
If for no other reason than that one of your stated goals is the creation of files that are the same as files created by Excel, I hope that you might in the future reconsider adding this feature. The main benefit as I see it is that it enables the user to get as close as possible to being able to create files that are the same as files created by Excel-- that is, it allows the user to create the same files when possible, and when not possible, to create a different, but equivalent file-- without the risk of running out of memory mid-task.
Anyway, thank you for your feedback, and more importantly, for all your good work on these libraries.
Lastly, I've noticed that in some comments, you (or perhaps another maintainer) has suggested that the maintainers might be willing to consider prioritization of certain features if, for lack of a better term, a bounty is offered. If that's potentially the case here, please let me know.
@jmcnamara commented on GitHub (Apr 5, 2019):
I automate the conversion of functional test cases (the ones that compare output against actual Excel files) from Perl to Python/C. I also automate the creation of stub functions and unit testcases based on XML elements. After that it is all artisanally hand crafted using only the finest of code. :-)
If you want to try then you can change the
Z_DEFAULT_COMPRESSIONtoZ_BEST_COMPRESSIONin packager.c: https://github.com/jmcnamara/libxlsxwriter/blob/master/src/packager.c#L1067I do accept bounties/donations (although I rarely get any) but mainly for pre-existing features that are due to be ported. The $ numbers on the relevant Feature Requests are mainly to remind people that features take effort and just hitting +1 and expecting it to be done isn't enough. However for this feature I'm not convinced of the general usefulness so I'd rather not take it on. Sorry.
@mewalig commented on GitHub (Apr 5, 2019):
Got it. Thank you