| 12 Apr 2025 |
Fernando Rodrigues | In reply to @guider-le-recit:matrix.org Apologies for the delayed update on this, in essence I modified the libpinyin source (ngram_bdb.cpp, chewing_large_table2_bdb.cpp, phrase_large_table3_bdb.cpp, punct_table_bdb.cpp) to add a DB set_flags(handle, DB_TXN_NOT_DURABLE) call immediately after db_create() and before DB open() for all database handles used during index generation. However, attempting to build failed during the make process when running gen_binary_files. The build log showed: ''BDB1566 DB_NOT_DURABLE interface requires an environment configured for the transaction subsystem''
Given that DB_TXN_NOT_DURABLE did not work, I reverted those changes and went back to test the difference between access methods. I modified only ngram_bdb.cpp to change the bigram.db file type from DB_HASH to DB_BTREE in all its DB open calls. The build completed this time, but the reproducibility check (--check) still failed. Running diffoscope showed that while bigram.db is now reported as a B-Tree file, it exhibits the exact same header difference pattern around offset 0x34 as all the other B-Tree files
I did not want to try refractoring a DB ENV, so I tried looking if there where any flags i missed that could handle this, there aren't any but instead I found that the problematic region starting at 0x34 corresponds to a 20-byte uid field within the common DBMETA structure (https://github.com/zvelo/BerkeleyDB/blob/master/src/dbinc/db_page.h) This uid is an apperantly inherently non-deterministic unique file identifier generated by BDB during database creation, influenced by runtime factors potentially including ASLR. BDB docs confirms there are no API flags controllable via DB->set_flags() on standalone handles to suppress or stabilize this uid generation.
I guess it seems that, the non-reproducibility affecting all generated BDB files stem directly from this volatile uid field. At this point i am tired and not sure what to do next * | 20:10:39 |
guider-le-recit | Thank you Fernando | 20:11:28 |
guider-le-recit | do i make a new issue or post message onto the original? | 20:11:50 |
guider-le-recit | * do i make a new issue or post the message onto the original? | 20:11:59 |
Fernando Rodrigues | I think it's best to make a new issue, since this affects more than just libpinyin. | 20:12:49 |
guider-le-recit | Okay, thank you once more | 20:13:31 |
emily | fantastic great work! | 21:35:55 |
emily | FWIW, BerkeleyDB was abandoned by Oracle. I know there are various forks and API-compatible replacements | 21:36:53 |
emily | maybe one of them avoids this issue? | 21:36:54 |
emily | it also might be an option to move packages off BerkeleyDB to alternative backends like GNU dbm where supported: https://fedoraproject.org/wiki/User:Pkubat/Draft_-_Removing_BerkeleyDB_from_Fedora | 21:37:48 |
emily | for libpinyin, libpinyin X GPLv3+ depends on KyotoCabinet since f24 | 21:41:20 |
emily | though I'm not sure if Kyoto Cabinet is maintained either 😆 | 21:42:12 |
emily | ah, https://dbmx.net/kyotocabinet/ points to https://dbmx.net/tkrzw/. | 21:42:35 |
emily | but https://github.com/libpinyin/libpinyin/blob/a6f4d3c239883b5e1dd0770ab2b433042845e9c9/configure.ac hardcodes only support for Berkeley DB and Kyoto Cabinet. | 21:43:04 |
emily | the latest Kyoto Cabinet is still like three years newer than the latest Berkeley DB, and there's a good chance it doesn't have this specific reproducibility bug, so… it may be a good option for libpinyin :) | 21:46:43 |
| 13 Apr 2025 |
| Bot_wxt1221 joined the room. | 13:32:05 |
guider-le-recit | Hi miss Emily, you are aboslutely correct, I edited the package.nix file, removed Berkeley DB from buildInputs and replaced it with kyotocabinet, added a list configureFlags = [ "--with-dbm=KyotoCabinet" ];, and now the build completes no more derivation errors | 13:32:30 |
guider-le-recit | * Hi miss Emily, you are aboslutely correct, I edited the package.nix file, removed Berkeley DB from buildInputs and replaced it with kyotocabinet, added a list configureFlags = [ "--with-dbm=KyotoCabinet" ];, and now the build completes with no more derivation errors | 13:32:51 |
guider-le-recit | Thank you | 13:33:13 |
emily | nice! | 13:33:45 |
guider-le-recit | I'm gonna make the github issue now and send the link so you can push your solution | 13:33:55 |
emily | solving Berkeley DB reproducibility issues would still be valuable in general though, since it's likely there's software that doesn't support anything else :) | 13:34:18 |
emily | I know Debian and Fedora have been working on getting rid of it for years, but I don't know if they've fully achieved that | 13:35:08 |
emily | documenting your great deep dive into the internals will definitely be valuable | 13:36:17 |
guider-le-recit | Are you aware of any links that i can read up on that? | 13:37:21 |
guider-le-recit | okay | 13:37:28 |
emily | https://fedoraproject.org/wiki/User:Pkubat/Draft_-_Removing_BerkeleyDB_from_Fedora is an old table from Fedora and https://lists.debian.org/debian-devel/2014/06/msg00338.html is an email from a decade-old mailing list thread in Debian talking about alternatives like LMDB | 13:39:34 |
emily | Debian still has a BDB package to this day though: https://packages.debian.org/source/sid/db5.3 | 13:39:39 |
emily | so I assume they didn't completely get rid of it :) | 13:40:24 |
guider-le-recit | How did you get that so fast? | 13:40:44 |