| 6 Apr 2025 |
Fernando Rodrigues | * Hi Keanu, welcome to Nixpkgs! libpinyin's issues don't come from runtime errors like memory allocation or uninitialised structs, but from the generation of the pinyin indexes. You're probably going to have to dig into how those indexes are generated by libpinyin. | 01:20:10 |
guider-le-recit | Thanks for pointing me in the right direction Fernando. My mistake I initially thought the issue was struct padding because of the binary differences shown in diffoscope. After looking at the upstream issue and examining the diffoscope output more carefully, I noticed the differences are in database files. I found the gen_binary_files utility in the build logs that seems responsible for generating these files. How does this utility process and sort the input data before writing to the database? | 03:40:43 |
Fernando Rodrigues | Good question. That's what we (and apparently also upstream) need you to explore. Dig into the utility's source code, and try to learn its inner workings. If you find something promising, collect your findings and share them in the issue. | 03:50:44 |
guider-le-recit | That sounds like fun, thank you Fernando I will start exploring and hopefully get back to you soon. | 03:58:37 |
| 7 Apr 2025 |
guider-le-recit | Hi Fernando i was wrong this wasn't fun at all, I've analyzed the load_text methods for the B-Tree tables (pinyin_index, phrase_index, etc.) and they seem to process input sequentially without obvious sources of non-deterministic insertion order.
I also traced bigram.db generation to import_interpolation using the Bigram class, confirming it uses DB_HASH. While the Bigram::store method looks deterministic given its SingleGram input, could the non-reproducibility of bigram.db stem from Berkeley DB's default DB_HASH implementation itself being sensitive to the build environment? Is this a known pattern, and are there ways to make BDB Hash generation reproducible? | 14:19:32 |
raboof | I don't know, but I do remember diffoscope has support for berkely db files, so that is at least a promising sign that people who care about reproducibility have been looking at those :) | 15:20:48 |
raboof | though "Format-specific differences are supported for Berkeley DB database files but no file-specific differences were detected" in this case so not very helpful :) | 15:37:32 |
guider-le-recit | The smiley faces help | 15:57:06 |
guider-le-recit | So instead we are working with BDB's internal metadata, not application level logic right? | 15:57:30 |