[GH-ISSUE #221] constant_memory extension for maximum-sized shared-string table? #178

Closed
opened 2026-05-05 11:50:32 -06:00 by gitea-mirror · 5 comments
Owner

Originally created by @mewalig on GitHub (Apr 4, 2019).
Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/221

Originally assigned to: @jmcnamara on GitHub.

I understand that in constant_memory mode, the shared string table is unavailable because you have to keep it in memory for lookups, and you can't know ahead of time how big it might need to be.

I have to create large XLSX files that have a LOT of repetitive strings. Using constant memory is a huge benefit, but the resulting XLSX files are unnecessarily very large (one might think that the zip process would work well on repetitive strings in XML... but it doesn't). Since the number of strings is relatively small, a shared string table probably would not require much memory, and would be hugely beneficial in creating optimized XLSX results.

So, I'm wondering if it would be feasible to provide an additional memory option, which would be to have a fixed-sized-memory string table. The sst is used if an existing match is found, or if memory is available, and would fall back to inline strings if not.

So it would look like this:

  • workbook options include a constant_sst_memory option specifying the maximum size of the SST
  • worksheet_write_string() always uses shared strings-- but it's possible that shared string max mem is zero, which basically means it always uses inline strings
  • lxw_get_sst_index() is modified to:
    • change RB_INSERT to RB_FIND
    • If found, use the result
    • If not found, then, if it has enough mem for the new value, then:
      • Create the RB element and insert it using RB_INSERT (just as it already does)
      • update sst->memory_used to include the size of the RB
    • If not found, and the sst doesn't have enough mem for the new value, then just use an inline string

Would you see anything in the above approach that I'm missing to make this feature possible? If not, and I get it working, would you consider adding it to the master branch?

Originally created by @mewalig on GitHub (Apr 4, 2019). Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/221 Originally assigned to: @jmcnamara on GitHub. I understand that in constant_memory mode, the shared string table is unavailable because you have to keep it in memory for lookups, and you can't know ahead of time how big it might need to be. I have to create large XLSX files that have a LOT of repetitive strings. Using constant memory is a huge benefit, but the resulting XLSX files are unnecessarily very large (one might think that the zip process would work well on repetitive strings in XML... but it doesn't). Since the number of strings is relatively small, a shared string table probably would not require much memory, and would be hugely beneficial in creating optimized XLSX results. So, I'm wondering if it would be feasible to provide an additional memory option, which would be to have a **fixed-sized-memory string table**. The sst is used if an existing match is found, or if memory is available, and would fall back to inline strings if not. So it would look like this: - workbook options include a constant_sst_memory option specifying the maximum size of the SST - worksheet_write_string() always uses shared strings-- but it's possible that shared string max mem is zero, which basically means it always uses inline strings - lxw_get_sst_index() is modified to: - change RB_INSERT to RB_FIND - If found, use the result - If not found, then, if it has enough mem for the new value, then: - Create the RB element and insert it using RB_INSERT (just as it already does) - update sst->memory_used to include the size of the RB - If not found, and the sst doesn't have enough mem for the new value, then just use an inline string Would you see anything in the above approach that I'm missing to make this feature possible? If not, and I get it working, would you consider adding it to the master branch?
gitea-mirror 2026-05-05 11:50:32 -06:00
  • closed this issue
  • added the
    question
    label
Author
Owner

@mewalig commented on GitHub (Apr 5, 2019):

Well I went ahead and did this. Added sst_memory to lxw_workbook_options to define the maximum memory that should be used for storing the SST before it starts using inline strings. Below is example usage. Attached is a patch. I'm hoping you will find it useful, in which case, would you be willing to merge (assume this code passes any necessary qc tests-- and if it doesn't pls feel free to lmk and I'll fix...)

lxw_workbook_options options = {.constant_memory = LXW_TRUE,
                                .sst_memory = 16384,
                                .tmpdir = NULL};

constant_memory_with_sst.patch.txt

<!-- gh-comment-id:480161854 --> @mewalig commented on GitHub (Apr 5, 2019): Well I went ahead and did this. Added ```sst_memory``` to ```lxw_workbook_options``` to define the maximum memory that should be used for storing the SST before it starts using inline strings. Below is example usage. Attached is a patch. I'm hoping you will find it useful, in which case, would you be willing to merge (assume this code passes any necessary qc tests-- and if it doesn't pls feel free to lmk and I'll fix...) lxw_workbook_options options = {.constant_memory = LXW_TRUE, .sst_memory = 16384, .tmpdir = NULL}; [constant_memory_with_sst.patch.txt](https://github.com/jmcnamara/libxlsxwriter/files/3046587/constant_memory_with_sst.patch.txt)
Author
Owner

@jmcnamara commented on GitHub (Apr 5, 2019):

I have to create large XLSX files that have a LOT of repetitive strings. Using constant memory is a huge benefit, but the resulting XLSX files are unnecessarily very large

How large? And how long do they take to create?

one might think that the zip process would work well on repetitive strings in XML... but it doesn't

Yes. I was surprised by that as well. Have you tried increasing the zlib compression level? It will add to the processing time but the files may be significantly smaller. You could test this outside the C library by unzipping one of your files and re-zipping it with system zip and different compression levels.

Since the number of strings is relatively small, a shared string table probably would not require much memory, and would be hugely beneficial in creating optimized XLSX results.

I've considered, in the past, adding a SST + single row hybrid of the two existing modes. The main motivation was that some applications, in particular Apple Numbers, didn't support the inline-string element. It would also allow the creation of files that were the same as files created by Excel, which is a primary goal of the library. However, recent versions of Numbers now support inline-strings so that is no longer an motivation and it has been a long time since anyone reported a third-party app issue with inline strings.

If you implemented something like that would your memory usage/file size tradeoff by acceptable?

would you consider adding it to the master branch

Unfortunately no.

The hybrid model described above is as far as I would be willing to go in terms of complexity and maintainability. But even then there haven't been any other requests for a feature like that.

Libxlsxwriter is a port of the Python module XlsxWriter which in turn is a port of the Perl module Excel::Writer::XLSX, all of which I wrote and maintain. So a new feature needs to be added to the Perl versions first, then to the Python version, and then to the C version. I don't intend to add any non-language specific feature that isn't in all three versions. And, as I said, there currently isn't anyone else asking for this feature.

Nevertheless thank you for your input. Your patch may be useful to someone else with a similar usecase.

John

<!-- gh-comment-id:480198295 --> @jmcnamara commented on GitHub (Apr 5, 2019): > I have to create large XLSX files that have a LOT of repetitive strings. Using constant memory is a huge benefit, but the resulting XLSX files are unnecessarily very large How large? And how long do they take to create? > one might think that the zip process would work well on repetitive strings in XML... but it doesn't Yes. I was surprised by that as well. Have you tried increasing the zlib compression level? It will add to the processing time but the files may be significantly smaller. You could test this outside the C library by unzipping one of your files and re-zipping it with system zip and different compression levels. > Since the number of strings is relatively small, a shared string table probably would not require much memory, and would be hugely beneficial in creating optimized XLSX results. I've considered, in the past, adding a SST + single row hybrid of the two existing modes. The main motivation was that some applications, in particular Apple Numbers, didn't support the inline-string element. It would also allow the creation of files that were the same as files created by Excel, which is a primary goal of the library. However, recent versions of Numbers now support inline-strings so that is no longer an motivation and it has been a long time since anyone reported a third-party app issue with inline strings. If you implemented something like that would your memory usage/file size tradeoff by acceptable? > would you consider adding it to the master branch Unfortunately no. The hybrid model described above is as far as I would be willing to go in terms of complexity and maintainability. But even then there haven't been any other requests for a feature like that. Libxlsxwriter is a port of the Python module [XlsxWriter ](https://github.com/jmcnamara/XlsxWriter)which in turn is a port of the Perl module [Excel::Writer::XLSX](https://github.com/jmcnamara/excel-writer-xlsx), all of which I wrote and maintain. So a new feature needs to be added to the Perl versions first, then to the Python version, and then to the C version. I don't intend to add any non-language specific feature that isn't in all three versions. And, as I said, there currently isn't anyone else asking for this feature. Nevertheless thank you for your input. Your patch may be useful to someone else with a similar usecase. John
Author
Owner

@mewalig commented on GitHub (Apr 5, 2019):

OK, did not realize that about the Python and Perl versions. Actually I have more use for the Python version, I just didn't realize until now that it had the constant_memory mode. Do you automate the Perl -> Python -> C conversion or maintain all three code bases by hand? I'm asking because I may have use down the road for both a Python and a C version of this feature and I'd be interested to learn how you maintain to weigh the work it might take if I were to do so in a separate fork.

Re your earlier questions, large would be, hundreds of thousands of rows, hundreds of columns, and maybe half a dozen worksheets each with that much data. The saved worksheets are about 100MB. I don't have figures on different zip compressions at the moment-- I would have to first convert saved-with-SST to saved-without-SST and don't have a tool on hand for that (though I'm guessing it's close to a 1-liner with this library).

If for no other reason than that one of your stated goals is the creation of files that are the same as files created by Excel, I hope that you might in the future reconsider adding this feature. The main benefit as I see it is that it enables the user to get as close as possible to being able to create files that are the same as files created by Excel-- that is, it allows the user to create the same files when possible, and when not possible, to create a different, but equivalent file-- without the risk of running out of memory mid-task.

Anyway, thank you for your feedback, and more importantly, for all your good work on these libraries.

Lastly, I've noticed that in some comments, you (or perhaps another maintainer) has suggested that the maintainers might be willing to consider prioritization of certain features if, for lack of a better term, a bounty is offered. If that's potentially the case here, please let me know.

<!-- gh-comment-id:480280263 --> @mewalig commented on GitHub (Apr 5, 2019): OK, did not realize that about the Python and Perl versions. Actually I have more use for the Python version, I just didn't realize until now that it had the constant_memory mode. Do you automate the Perl -> Python -> C conversion or maintain all three code bases by hand? I'm asking because I may have use down the road for both a Python and a C version of this feature and I'd be interested to learn how you maintain to weigh the work it might take if I were to do so in a separate fork. Re your earlier questions, large would be, hundreds of thousands of rows, hundreds of columns, and maybe half a dozen worksheets each with that much data. The saved worksheets are about 100MB. I don't have figures on different zip compressions at the moment-- I would have to first convert saved-with-SST to saved-without-SST and don't have a tool on hand for that (though I'm guessing it's close to a 1-liner with this library). If for no other reason than that one of your stated goals is the creation of files that are the same as files created by Excel, I hope that you might in the future reconsider adding this feature. The main benefit as I see it is that it enables the user to get _as close as possible_ to being able to create files that are the same as files created by Excel-- that is, it allows the user to create the same files when possible, and when not possible, to create a different, but equivalent file-- without the risk of running out of memory mid-task. Anyway, thank you for your feedback, and more importantly, for all your good work on these libraries. Lastly, I've noticed that in some comments, you (or perhaps another maintainer) has suggested that the maintainers might be willing to consider prioritization of certain features if, for lack of a better term, a bounty is offered. If that's potentially the case here, please let me know.
Author
Owner

@jmcnamara commented on GitHub (Apr 5, 2019):

Do you automate the Perl -> Python -> C conversion or maintain all three code bases by hand?

I automate the conversion of functional test cases (the ones that compare output against actual Excel files) from Perl to Python/C. I also automate the creation of stub functions and unit testcases based on XML elements. After that it is all artisanally hand crafted using only the finest of code. :-)

I don't have figures on different zip compressions at the moment-- I would have to first convert saved-with-SST to saved-without-SST and don't have a tool on hand for that (though I'm guessing it's close to a 1-liner with this library).

If you want to try then you can change the Z_DEFAULT_COMPRESSION to Z_BEST_COMPRESSION in packager.c: https://github.com/jmcnamara/libxlsxwriter/blob/master/src/packager.c#L1067

maintainers might be willing to consider prioritization of certain features if, for lack of a better term, a bounty is offered. If that's potentially the case here, please let me know.

I do accept bounties/donations (although I rarely get any) but mainly for pre-existing features that are due to be ported. The $ numbers on the relevant Feature Requests are mainly to remind people that features take effort and just hitting +1 and expecting it to be done isn't enough. However for this feature I'm not convinced of the general usefulness so I'd rather not take it on. Sorry.

<!-- gh-comment-id:480304407 --> @jmcnamara commented on GitHub (Apr 5, 2019): > Do you automate the Perl -> Python -> C conversion or maintain all three code bases by hand? I automate the conversion of functional test cases (the ones that compare output against actual Excel files) from Perl to Python/C. I also automate the creation of stub functions and unit testcases based on XML elements. After that it is all artisanally hand crafted using only the finest of code. :-) > I don't have figures on different zip compressions at the moment-- I would have to first convert saved-with-SST to saved-without-SST and don't have a tool on hand for that (though I'm guessing it's close to a 1-liner with this library). If you want to try then you can change the `Z_DEFAULT_COMPRESSION` to `Z_BEST_COMPRESSION` in packager.c: https://github.com/jmcnamara/libxlsxwriter/blob/master/src/packager.c#L1067 > maintainers might be willing to consider prioritization of certain features if, for lack of a better term, a bounty is offered. If that's potentially the case here, please let me know. I do accept bounties/donations (although I rarely get any) but mainly for pre-existing features that are due to be ported. The $ numbers on the relevant Feature Requests are mainly to remind people that features take effort and just hitting +1 and expecting it to be done isn't enough. However for this feature I'm not convinced of the general usefulness so I'd rather not take it on. Sorry.
Author
Owner

@mewalig commented on GitHub (Apr 5, 2019):

Got it. Thank you

<!-- gh-comment-id:480390200 --> @mewalig commented on GitHub (Apr 5, 2019): Got it. Thank you
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/libxlsxwriter#178
No description provided.