
0·
11 days agocoming up with that right now, check my comment below.
I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.
It should be done in less than 1-2 hour, and I will upload it to Archive. org

coming up with that right now, check my comment below.
I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.
It should be done in less than 1-2 hour, and I will upload it to Archive. org

Superb, I have 1-8, 11-12.
Only remaining 10 (to complete - downloading from Archive.org now)
Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.
Current estimate of files list is:
Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.
hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:
UPDATE DATASET 9 Files List:
Progress:
Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
link: https://archive.org/details/epstein-dataset9-index
The link is live and shows the 75.7MB JSON file available for downloa