@susadmin

susadmin@lemmy.world · edit-2 8 days ago

The files in this (early) dataset 12 are identical to the dataset 12 here, which is the link in the OP. The MD5 hashes are identical.

I shared a .csv file of the calculated MD5 hashes here

susadmin@lemmy.world · 8 days ago

rsync --checksum is better than my file name + file size comparison, since you are calculating the hash of each file and comparing it to the hash all other files. For example, if there is a file called data1.pdf with size 1024 bytes in dataset9-v1, and another file called data1.pdf with size 1024 bytes in dataset9-v2, but their content is different, my method will still detect them as identical files.

I’m going to modify my script to calculate and compare the hashes of all files that I previously determined to be duplicates. If the hashes of the duplicates in dataset9 (45GB torrent) match the hashes of the duplicates in dataset9 (86GB torrent), then they are in fact duplicates between the two datasets.

susadmin@lemmy.world · 8 days ago

archive.org is a great idea. Post the link here when you can!

susadmin@lemmy.world · 8 days ago

Yes! I’m not sure the best way to do that - upload them to MEGA and message me a download link?

susadmin@lemmy.world · 8 days ago

I’m in the process of downloading both dataset 9 torrents (45.63 GB + 86.74 GB). I will then compare the filenames in both versions (the 45.63GB version has 201,358 files alone), note any duplicates, and merge all unique files into one folder. I’ll upload that as a torrent once it’s done so we can get closer to a complete dataset 9 as one file.