@Kindly_District9380

Kindly_District9380@lemmy.world · 11 days ago

hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:

UPDATE DATASET 9 Files List:

Progress:

Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)

Downloading individual files: 30K files / 41GB so far
Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now
Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.

link: https://archive.org/details/epstein-dataset9-index

The link is live and shows the 75.7MB JSON file available for downloa

Kindly_District9380@lemmy.world · 11 days ago

coming up with that right now, check my comment below.

I made this script and slowly about to finish the crawl, it’s close to 20k+ pages.

https://pastebin.com/zbF0Rmfx

It should be done in less than 1-2 hour, and I will upload it to Archive. org

Kindly_District9380@lemmy.world · 12 days ago

Superb, I have 1-8, 11-12.

Only remaining 10 (to complete - downloading from Archive.org now)

Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.

Current estimate of files list is:

~1,022,500 files (50 files/page × 20,450 pages)
My scraped index so far: 528,586 files / 634,573 URLs
Currently downloading individual files: 24,371 files (29GB)
Download rate ~1 file/sec to avoid getting blocked = ~12 days continuous for full set

Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.