mirror of
https://github.com/jmcnamara/libxlsxwriter.git
synced 2026-05-15 14:15:54 -06:00
[GH-ISSUE #260] External MD5 Library #205
Labels
No labels
awaiting user feedback
bug
cmake
cmake
docs
feature request
in progress
long term
medium term
medium term
pull-request
question
question
ready to close
short term
under investigation
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/libxlsxwriter#205
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @squinky86 on GitHub (Jan 2, 2020).
Original GitHub issue: https://github.com/jmcnamara/libxlsxwriter/issues/260
Originally assigned to: @jmcnamara on GitHub.
With the latest libxlsxwriter (0.9.1), my software has a linking error when I try to link it against openssl and libxlsxwriter:
This is because openssl defines the MD5 functions that libxlsxwriter defines. I can fix this for this particular software by adding the "-DUSE_NO_MD5=ON" cmake flag, but then my software does not take advantage of the duplicate-image-detection optimizations of the latest libxlsxwriter release.
Suggested fix: libxlsxwriter should allow the external linking of a library that provides the md5 hash functions (similar to -DZLIB_LIBRARY=) or should rename its provided md5 functions to prevent a collision.
@jmcnamara commented on GitHub (Jan 2, 2020):
Ack. I should have thought of this. The MD5 lib I included uses OpenSSL compatible MD5 functions.
I don't know which is the best approach. Renaming the MD5 functions is the simplest. Linking against OpenSSL would remove duplication but will complicate the build systems and prerequisites.
I'll need to think about this before I make the change.
@jmcnamara commented on GitHub (Jan 2, 2020):
@Alexhuszagh, @evanmiller Any thoughts on this issue?
@evanmiller commented on GitHub (Jan 2, 2020):
Should something in the build be defining
HAVE_OPENSSL?1290420818/third_party/md5/md5.c (L38)@jmcnamara commented on GitHub (Jan 6, 2020):
@evanmiller I don't think that is the issue. I think this is probably a link rather than build issue.
@squinky86 I can't reproduce this. I tried the following:
Do you have an example to reproduce it.
I'm currently leaning towards renaming the MD5 functions to avoid a conflict since I don't think changing the build systems to link against openssl/libcrypto is worth it. However, I'd link to be able to reproduce this first.
@jmcnamara commented on GitHub (Jan 6, 2020):
I had another go at reproducing this with a simple openssl based example:
This will fail complation since we don't link the MD5 functions.
Then link the MD5 functions from libcrypto and libxlsxwriter:
Again, I can't reproduce your error. I'm not saying that it doesn't exist, just that I can't reproduce it.
@Alexhuszagh commented on GitHub (Jan 6, 2020):
@jmcnamara I think the issue only occurs when links to libxlsxwriter and to OpenSSL, that is, both OpenSSL and libxlsxwriter will build fine, but downstream projects using both may have issues.
I currently don't have a test computer to work with (mine's in the repair shop, and I'm trying to find a way to get a Unix-like environment at the library), but I think this may be a decently trivial fix, one of 2 ways:
Most Compact Fix
If we want to minimize the binary size, ideally we'd avoiding duplicating any functionality whenever possible. To do this, we'd have to make the build system aware of OpenSSL, so we'd have:
1). An option to use MD5 or not (
USE_MD5).2). An option to use OpenSSL or not for all cryptographic functions (
USE_OPENSSL).If it uses OpenSSL, then it links to the OpenSSL MD5 functions, otherwise, it will link to the imported symbols (optionally renamed), and use certain macros to ensure the proper symbol names are used.
This, however, increases build system complexity, which may not be desired. If the symbols are not renamed, and the user accidentally uses both OpenSSL and libsxlwriter without making the build system aware of it, it may become the source of a lot of frustration and spurious bug reports.
Make OpenSSL a Dependency
Another solution would be to make OpenSSL a dependency if the MD5 option is used. I understand there are other reasons why this is not desired.
Simple Option
Just rename all the symbols. This will always work, with no issues, but will lead to slightly larger binary sizes if both OpenSSL and libxlsxwriter are both used in a project.
@squinky86 commented on GitHub (Jan 6, 2020):
@Alexhuszagh is correct - this only occurs in downstream projects that link to both libxlsxwriter and openssl due to the name conflict. I'm about to create a pull request for the Simple Option - it makes the binary no bigger than it would be if the md5 functions that are in the third party folder were built-in. Though I do believe the OpenSSL dependency is more elegant.
@squinky86 commented on GitHub (Jan 6, 2020):
Another option: zlib provides a crc32 and adler32 which could be used instead of md5 without introducing additional overhead, since zlib is already a dependency.
@Alexhuszagh commented on GitHub (Jan 6, 2020):
@squinky86 I'm slightly worried about the number of bits possible with CRC32 and Adler32, both of which produce a 32-bit hash. MD5 has its own issues, but if we're worried about hash collisions, then we should probably prefer a more modern hash function, and preferably with more bits.
If it's purely accidental, the chance of collision being >= 50% is ~77,000 images with Adler32 or CRC32, while with MD5 it's ~2.17e19 images. A lot, but still, the chance of accidentally removing an image and replacing it with a false duplicate is noticeably non-zero with random data with a 32-bit checksum, but practically 0 for any 128-bit hash. There is a 1% collision chance with ADLER32 and CRC32 at ~9200 images. With enough users and enough data, this will eventually occur.
This is using the fast approximation of the birthday problem, where the approximate number of values required to reach a collision is approximately
math.sqrt(-2*n*math.log(1-p, math.e))(using Python code), wherenis the number of possible hash values, andpis the probability of a collision.I understand these probabilities are very low, and that pretty much any client chokes on larger number of images, but I'd like to mitigate any potential chance of a hash collision if possible.
@RaFaeL-NN commented on GitHub (Jan 6, 2020):
What if generate both 32-bit checksums and check their comparison? It will be something like 64-bit
@squinky86 commented on GitHub (Jan 6, 2020):
@jmcnamara: I was able to reproduce by compiling a static openssl (./Configure mingw64 no-shared no-asm no-err no-tests CROSS_COMPILE=x86_64-w64-mingw32-) and static libxlsxwriter (cmake ./ -DUSE_STANDARD_TMPFILE=OFF -DCMAKE_TOOLCHAIN_FILE=../cross-compile.cmake -DBUILD_SHARED_LIBS=OFF -DZLIB_INCLUDE_DIR=/home/squinky86/w64/zlib-1.2.11 -DZLIB_LIBRARY=/home/squinky86/w64/zlib-1.2.11/libzlibstatic.a) then link the hello world binary (https://libxlsxwriter.github.io/hello_8c-example.html) against both the libcrypto and libxlsxwriter libraries.
@Alexhuszagh: SHA/BLAKE2 make more sense as the "correct" solutions. MD5 is an acceptable solution, but you can always create a use case where there's a collision (https://www.hacksandsecurity.org/posts/two-images-have-same-md5-hash-md5-collision-example). Adler and CRC combination makes sense as the most efficient solution without increasing static binary size.
@Alexhuszagh commented on GitHub (Jan 6, 2020):
@RaFaeL-NN That's not truly an acceptable solution, because it would assume both checksums are independent, which I'm not sure they are. In fact, I'm highly dubious they are.
@squinky86 Yeah, ideally we'd use SHA-2 or SHA-3, but MD5 works as long as we're assuming no malicious data (malicious data would then reduce the collision chance to 2^24, which is even worse than the checksum). Anyway, it's up to @jmcnamara to ultimately decide what hash algorithm to use. Ideally, I'd be a proponent of SHA-2 or SHA-3, which can also be found in OpenSSL (or elsewhere). Blake would work too, for sure, but that may be overkill.
@jmcnamara commented on GitHub (Jan 7, 2020):
Thanks everyone for the input. It was a good discussion.
I'm going to avoid getting involved with OpenSSL/libcrypto and just rename the embedded MD5 functions, which I should probably have done from the start.
@jmcnamara commented on GitHub (Jan 7, 2020):
I've push a fix for this to master. @squinky86 can you try it when you get a chance.
@jmcnamara commented on GitHub (Jan 9, 2020):
Fixed upstream. Closing.