Epstein Files Jan 30, 2026

Data hoarders on reddit have been hard at work archiving the latest Epstein Files release from the U.S. Department of Justice. Below is a compilation of their work with download links.

Please seed all torrent files to distribute and preserve this data.

Ref: https://old.reddit.com/r/DataHoarder/comments/1qrk3qk/epstein_files_datasets_9_10_11_300_gb_lets_keep/

Epstein Files Data Sets 1-8: INTERNET ARCHIVE LINK

Epstein Files Data Set 1 (2.47 GB): TORRENT MAGNET LINK
Epstein Files Data Set 2 (631.6 MB): TORRENT MAGNET LINK
Epstein Files Data Set 3 (599.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 4 (358.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 5: (61.5 MB) TORRENT MAGNET LINK
Epstein Files Data Set 6 (53.0 MB): TORRENT MAGNET LINK
Epstein Files Data Set 7 (98.2 MB): TORRENT MAGNET LINK
Epstein Files Data Set 8 (10.67 GB): TORRENT MAGNET LINK


Epstein Files Data Set 9 (Incomplete). Only contains 49 GB of 180 GB. Multiple reports of cutoff from DOJ server at offset 48995762176.

ORIGINAL JUSTICE DEPARTMENT LINK

  • TORRENT MAGNET LINK (removed due to reports of CSAM)

/u/susadmin’s More Complete Data Set 9 (96.25 GB)
De-duplicated merger of (45.63 GB + 86.74 GB) versions

  • TORRENT MAGNET LINK (removed due to reports of CSAM)

Epstein Files Data Set 10 (78.64GB)

ORIGINAL JUSTICE DEPARTMENT LINK

  • TORRENT MAGNET LINK (removed due to reports of CSAM)
  • INTERNET ARCHIVE FOLDER (removed due to reports of CSAM)
  • INTERNET ARCHIVE DIRECT LINK (removed due to reports of CSAM)

Epstein Files Data Set 11 (25.55GB)

ORIGINAL JUSTICE DEPARTMENT LINK

SHA1: 574950c0f86765e897268834ac6ef38b370cad2a


Epstein Files Data Set 12 (114.1 MB)

ORIGINAL JUSTICE DEPARTMENT LINK

SHA1: 20f804ab55687c957fd249cd0d417d5fe7438281
MD5: b1206186332bb1af021e86d68468f9fe
SHA256: b5314b7efca98e25d8b35e4b7fac3ebb3ca2e6cfd0937aa2300ca8b71543bbe2


This list will be edited as more data becomes available, particularly with regard to Data Set 9 (EDIT: NOT ANYMORE)


EDIT [2026-02-02]: After being made aware of potential CSAM in the original Data Set 9 releases and seeing confirmation in the New York Times, I will no longer support any effort to maintain links to archives of it. There is suspicion of CSAM in Data Set 10 as well. I am removing links to both archives.

Some in this thread may be upset by this action. It is right to be distrustful of a government that has not shown signs of integrity. However, I do trust journalists who hold the government accountable.

I am abandoning this project and removing any links to content that commenters here and on reddit have suggested may contain CSAM.

Ref 1: https://www.nytimes.com/2026/02/01/us/nude-photos-epstein-files.html
Ref 2: https://www.404media.co/doj-released-unredacted-nude-images-in-epstein-files

  • Arthas@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    6 days ago

    Epstein Files - Complete Dataset Audit Report

    Generated: 2026-02-16 | Scope: Datasets 1–12 (VOL00001–VOL00012) | Total Size: ~220 GB


    Background

    The Epstein Files consist of 12 datasets of court-released documents, each containing PDF files identified by EFTA document IDs. These datasets were collected from links shared throughout this Lemmy thread, with Dataset 9 cross-referenced against a partial copy we had downloaded independently.

    Each dataset includes OPT/DAT index files — the official Opticon load files used in e-discovery — which serve as the authoritative manifest of what each dataset should contain. This audit was compiled to:

    1. Verify completeness — compare every dataset against its OPT index to identify missing files
    2. Validate file integrity — confirm that all files are genuinely the file types they claim to be, not just by extension but by parsing their internal structure
    3. Detect duplicates — identify any byte-identical files within or across datasets
    4. Generate checksums — produce SHA256 hashes for every file to enable downstream integrity verification

    Executive Summary

    Metric Value
    Total Unique Files 1,380,939
    Total Document IDs (OPT) 2,731,789
    Missing Files 25 (Dataset 9 only)
    Corrupt PDFs 3 (Dataset 9 only)
    Duplicates (intra + cross-dataset) 0
    Mislabeled Files 0
    Overall Completeness 99.998%

    Dataset Overview

                          EPSTEIN FILES - DATASET SUMMARY
      ┌─────────┬──────────┬───────────┬──────────┬─────────┬─────────┬─────────┐
      │ Dataset │  Volume  │   Files   │ Expected │ Missing │ Corrupt │  Size   │
      ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
      │    1    │ VOL00001 │    3,1583,158002.5 GB │
      │    2    │ VOL00002 │      57457400633 MB │
      │    3    │ VOL00003 │       676700600 MB │
      │    4    │ VOL00004 │      15215200359 MB │
      │    5    │ VOL00005 │      1201200062 MB │
      │    6    │ VOL00006 │       13130053 MB │
      │    7    │ VOL00007 │       17170098 MB │
      │    8    │ VOL00008 │   10,59510,5950011 GB │
      │    9    │ VOL00009 │  531,282531,30725396 GB │
      │   10    │ VOL00010 │  503,154503,1540082 GB │
      │   11    │ VOL00011 │  331,655331,6550027 GB │
      │   12    │ VOL00012 │      15215200120 MB │
      ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
      │  TOTAL  │          │1,380,9391,380,964253    │ ~220 GB │
      └─────────┴──────────┴───────────┴──────────┴─────────┴─────────┴─────────┘
    

    Notes

    • DS1: Two identical copies found (6,316 files on disk). Byte-for-byte identical via SHA256. Table above reflects one copy (3,158). One copy is redundant.
    • DS2: 699 document IDs map to 574 files (multi-page PDFs)
    • DS3: 1,847 document IDs across 67 files (~28 pages/doc avg)
    • DS5: 1:1 document-to-file ratio (single-page PDFs)
    • DS6: Smallest dataset by file count. ~37 pages/doc avg.
    • DS9: Largest dataset. 25 missing from OPT index, 3 structurally corrupt.
    • DS10: Second largest. 950,101 document IDs across 503,154 files.
    • DS11: Third largest. 517,382 document IDs across 331,655 files.
    Dataset 9 — Missing Files (25)
    EFTA00709804    EFTA00823221    EFTA00932520
    EFTA00709805    EFTA00823319    EFTA00932521
    EFTA00709806    EFTA00877475    EFTA00932522
    EFTA00709807    EFTA00892252    EFTA00932523
    EFTA00770595    EFTA00901740    EFTA00984666
    EFTA00774768    EFTA00912980    EFTA00984668
    EFTA00823190    EFTA00919433    EFTA01135215
    EFTA00823191    EFTA00919434    EFTA01135708
    EFTA00823192
    
    Dataset 9 — Corrupted Files (3)
    File Size Error
    EFTA00645624.pdf 35 KB Missing trailer dictionary, broken xref table
    EFTA01175426.pdf 827 KB Invalid xref entries, no page tree (0 pages)
    EFTA01220934.pdf 1.1 MB Missing trailer dictionary, broken xref table

    Valid %PDF- headers but cannot be rendered due to structural corruption. Likely corrupted during original document production or transfer.


    File Type Verification

    Two levels of verification performed on all 1,380,939 files:

    1. Magic Byte Detection (file command) — All files contain valid %PDF- headers. 0 mislabeled.
    2. Deep PDF Validation (pdfinfo, poppler 26.02.0) — Parsed xref tables, trailer dictionaries, and page trees. 3 structurally corrupt (Dataset 9 only).

    Duplicate Analysis

    • Within Datasets: 0 intra-dataset hash duplicates across all 12 datasets.
    • Cross-Dataset: All 1,380,939 SHA256 hashes compared. 0 cross-dataset duplicates — every file is unique.
    • Dataset 1 Two Copies: Both copies byte-for-byte identical (SHA256 verified). One is redundant (~2.5 GB).

    Integrity Verification

    SHA256 checksums were generated for every file across all 12 datasets. Individual checksum files are available per dataset:

    File Hashes Size
    dataset_1_SHA256SUMS.txt 3,158 256 KB
    dataset_2_SHA256SUMS.txt 574 47 KB
    dataset_3_SHA256SUMS.txt 67 5.4 KB
    dataset_4_SHA256SUMS.txt 152 12 KB
    dataset_5_SHA256SUMS.txt 120 9.7 KB
    dataset_6_SHA256SUMS.txt 13 1.1 KB
    dataset_7_SHA256SUMS.txt 17 1.4 KB
    dataset_8_SHA256SUMS.txt 10,595 859 KB
    dataset_9_SHA256SUMS.txt 531,282 42 MB
    dataset_10_SHA256SUMS.txt 503,154 40 MB
    dataset_11_SHA256SUMS.txt 331,655 26 MB
    dataset_12_SHA256SUMS.txt 152 12 KB

    To verify any file against its checksum:

    shasum -a 256 <filename>
    

    If you’d like access to the SHA256 checksum files or can help host them, send me a DM.


    Methodology
    1. Hash Generation: SHA256 checksums via shasum -a 256 with 8-thread parallel processing
    2. OPT Index Comparison: Each dataset’s OPT load file parsed for expected file paths, compared against files on disk
    3. Intra-Dataset Duplicate Detection: SHA256 hashes compared within each dataset
    4. Cross-Dataset Duplicate Detection: All 1,380,939 hashes compared across all 12 datasets
    5. File Type Verification (Level 1): Magic byte detection via file command
    6. Deep PDF Validation (Level 2): Structure validation via pdfinfo (poppler 26.02.0) — xref tables, trailer dictionaries, page trees
    7. Cross-Copy Comparison: Dataset 1’s two copies compared via full SHA256 diff

    Recommendations

    1. Remove Dataset 1 duplicate copy — saves ~2.5 GB
    2. Document the 25 missing Dataset 9 files — community assistance may help locate these
    3. Preserve OPT/DAT index files — authoritative record of expected contents
    4. Distribute SHA256SUMS.txt files — for downstream integrity verification

    Report generated as part of the Epstein Files preservation and verification project.

  • Wild_Cow_5769@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    @wild_cow_5769:matrix.org If someone has a group working on finding the dataset.

    There are billions of people on earth. Someone downloaded dataset 9 before the link was taken down. We just have to find them :)

  • Arthas@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    some bad news, it looks like the data 9 zip file link doesn’t work anymore. They appear to have removed the file so my download stopped at 36gb. I’m not familiar with their site so is this normal for them to remove the files and maybe put them back again once they’ve reorganized them and at the same link location? or are we having to do the scrape of each pdf like another user has been doing?

  • DigitalForensick@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    While I feel hopeful that we will be able to reconstruct the archive and create some sort of baseline that can be put back out there, I also cant stop thinking about the “and then what” aspect here. We’ve see our elected officials do nothing with this info over and over again and I’m worried this is going to repeat itself.

    I’m fully open to input on this, but I think having a group path forward is useful here. These are the things I believe we can do to move the needle.

    Right Now:

    1. Create a clean Data Archive for each of the known datasets (01-12). Something that is actually organized and accessible.
    2. Create a working Archive Directory containing an “itemized” reference list (SQL DB?) the full Data Archive, with each document’s listed as a row with certain metadata. Imagining a Github repo that we can all contribute to as we work. – File number – Dir. Location – File type (image, legal record, flight log, email, video, etc.) – File Status (Redacted bool, Missing bool, Flagged bool
    3. Infill any MISSING records where possible.
    4. Extract images away from .pdf format, Breakout the “Multi-File” pdfs, renaming images/docs by file number. (I made a quick script that does this reliably well.)
    5. Determine which files were left as CSAM and “redact” them ourselves, removing any liability on our part.

    What’s Next: Once we have the Archive and Archive Directory. We can begin safely and confidently walking through the Directory as a group effort and fill in as many files/blanks as possible.

    1. Identify and dedact all documents with garbage redactions, (remember the copy/paste DOJ blunders from December) & Identify poorly positioned redaction bars to uncover obfuscated names
    2. LABELING! If we could start adding labels to each document in the form of tags that contain individuals, emails, locations, businesses - This would make it MUCH easier for people to “connect the dots”
    3. Event Timeline… This will be hard, but if we can apply a timeline ID to each document, we can put the archive in order of events
    4. Create some method for visualizing the timeline, searching, or making connection with labels.

    We may not be detectives, legislators, or law men, but we are sleuth nerds, and the best thing we can do is get this data in a place that can allow others to push for justice and put an end to this crap once and for all. Its lofty, I know, but enough is enough. …Thoughts?

    • PeoplesElbow@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      19 days ago

      We definitely need a crowdsourced method for going through all the files. I am currently building a solo cytoscape tool to try out making an affiliation graph, but expanding this to be a tool for a community, with authorization to just allow whitelisted individuals work on it, that’s beyond my scope and I can’t volunteer to make such an important tool, but I am happy to offer my help building it. I can convert my existing tool to a prototype if anyone wants to collaborate with me on it. I am an amateur, but I will spend all the Cursor Credits on this.

    • ATroubledMaker@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      18 days ago

      So I know how to do a lot of this and bring something significant insofar as an understanding of both the gravity and volume of things here. Looking through the way everything and anything that has been released has been organized, well, it’s not. This isn’t how an evidence production should ever look.

      There is a way to best organize this and to do so how it would be expected for the presentation of a catalog of digital evidence. I’m aware of this because I’ve done it for years.

      But almost if not maybe even more important is that while there are monsters still hidden in these documents, whether released or still held back, there is something else to consider.

      Those who are involved and know who the monsters are and can never forget them. Ever.

      I took an interest in this specifically because I felt a moral obligation as someone who has been personally affected in this way just not by these specific monsters. However what I do know is the very structure that allows them to roam free, unscathed, even able to sleep at night. What failed to protect those who were harmed also failed me and when I do sleep it is the nightmare that also can never be forgotten.

      This resulted in learning how to spot their fuck ups because I knew what they were and had no reason to trust that it would fix itself. With that said the insight of someone who understands this through unfortunate lived experience provides something that cannot be learned and something I hope others will never be forced to.

      I have msged a few people. One responded. Just trust me when I say that if you are to work collaboratively, have someone who understands the pain you are just going to be reading.

      I will help where it’s needed and it’s needed.

    • Wild_Cow_5769@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      19 days ago

      GFD….

      My 2 cents. As a father of only daughters…

      If we don’t weed out this sick behavior as a society we never will.

      My thoughts are enough is enough.

      Once the files are gone there is little to 0 chance they are ever public again….

      You expect me to believe that a “oh shit we messed up” was accident?

      It’s the perfect excuse… so no one looks at the files.

      That’s my 2 cents.

      • DigitalForensick@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        18 days ago

        I’ve been thinking a lot about this whole thing. I don’t want to be worried or fearful here - we have done nothing wrong! Anything we have archived was provided to us directly by them in the first place. There are whispers all over the internet, random torrents being passed around, conspiracies, etc., but what are we actually doing other than freaking ourselves out (myself at least) and going viral with an endless stream of “OMG LOOK AT THIS FILE” videos/posts.

        I vote to remove any of the ‘concerning’ files and backfill with blank placeholder PDFS with justification, then collect everything we have so far, create file hashes, and put out a clean + stable archive on everything we have so far. a safe indexed archive We wipe away any concerns and can proceed methodically through blood trail of documents, resulting in an obvious and accessible collection of evidence. From there we can actually start organizing to create a tool that can be used to crowd source tagging, timestamping, and parsing the data. I’m a developer and am happy to offer my skillset.

        Taking a step back - Its fun to do the “digital sleuth” thing for a while, but then what? We have the files…(mostly)… Great. We all have our own lives, jobs, and families, and taking actual time to dig into this and produce a real solution that can actually make a difference is a pretty big ask. That said, this feels like a moment where we finally can make an actual difference and I think its worth committing to. If any of you are interested in helping beyond archival, please lmk.

        I just downloaded matrix, but I’m new to this, so I’m not sure how that all works. Happy to link up via discord, matrix, email, or whatever.

      • Wild_Cow_5769@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        This entire thing smells funny. Even OP turned ghost on the threat of suspect images that no one has seen…

        Ask yourself. How did the times or whoever came up with this narrative even find these “suspect” images in a few hours when it seems no one in the world came even download the zip…

        • kutt@lemmy.world
          link
          fedilink
          arrow-up
          0
          ·
          19 days ago

          A person made a website just to host links and thumbnails for a better interface to the videos on the DoJ website.

          They deleted everything including their account the same day.

          Everyone. I know website is showing all blank. This is unfortunately the end of my little project. Due to certain circumstances, I had to take it down. Thank you everyone for supporting me and my effort.

          Edit: Link

  • acelee1012@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    is anyone else having issues getting dataset 10 to start downloading? it has been sittiing at 0 percent for a day while everything else is done and seeding. it shows connections to peers, rechecking does nothing, deleting and re-adding does nothing, asking tracker for more peers does nothing

      • acelee1012@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        i am not seeing any errors, has just been stuck on downloading status with nothing going through. I originally added everything around the same time and all the other ones went through fine. I figured it was bugged or something so removed then readded it several times to no avail. I am not sure what else to try

    • Nomad64@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      19 days ago

      I have been seeding all of the datasets since Sunday. The copy of set 9 has been the busiest, with set 10 a distant second. I plan on seeding them for quite a while yet, and also picking up a consolidated torrent when that becomes available. Hopefully you are able to get connected via the Swarm.

  • susadmin@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    22 days ago

    I’m in the process of downloading both dataset 9 torrents (45.63 GB + 86.74 GB). I will then compare the filenames in both versions (the 45.63GB version has 201,358 files alone), note any duplicates, and merge all unique files into one folder. I’ll upload that as a torrent once it’s done so we can get closer to a complete dataset 9 as one file.

    • Kindly_District9380@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      Superb, I have 1-8, 11-12.

      Only remaining 10 (to complete - downloading from Archive.org now)

      Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.

      Current estimate of files list is:

      • ~1,022,500 files (50 files/page × 20,450 pages)
      • My scraped index so far: 528,586 files / 634,573 URLs
      • Currently downloading individual files: 24,371 files (29GB)
      • Download rate ~1 file/sec to avoid getting blocked = ~12 days continuous for full set

      Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.

      • GorillaCall@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        anyone have the original 186gb magnet link from that thread? someone said reddit keeps nuking it because it implicates reddit admins like spez

    • helpingidiot@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      Have a good night. I’ll be waiting to download it, seed it, make hardcopies and redistribute it.

      Please check back in with us

    • epstein_files_guy@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      looking forward to your torrent, will seed.

      I have several incomplete sets of files from dataset 9 that I downloaded with a scraped set of urls - should I try to get them to you to compare as well?

      • susadmin@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        22 days ago

        Yes! I’m not sure the best way to do that - upload them to MEGA and message me a download link?

        • epstein_files_guy@lemmy.world
          link
          fedilink
          arrow-up
          0
          ·
          22 days ago

          maybe archive.org? that way they can be torrented if others want to attempt their own merging techniques? either way it will be a long upload, my speed is not especially good. I’m still churning through one set of urls that is 1.2M lines, most are failing but I have 65k from that batch so far.

    • xodoh74984@lemmy.worldOP
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      When merging versions of Data Set 9, is there any risk of loss with simply using rsync --checksum to dump all files into one directory and merge the sets?

      • susadmin@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        22 days ago

        rsync --checksum is better than my file name + file size comparison, since you are calculating the hash of each file and comparing it to the hash all other files. For example, if there is a file called data1.pdf with size 1024 bytes in dataset9-v1, and another file called data1.pdf with size 1024 bytes in dataset9-v2, but their content is different, my method will still detect them as identical files.

        I’m going to modify my script to calculate and compare the hashes of all files that I previously determined to be duplicates. If the hashes of the duplicates in dataset9 (45GB torrent) match the hashes of the duplicates in dataset9 (86GB torrent), then they are in fact duplicates between the two datasets.

        • xodoh74984@lemmy.worldOP
          link
          fedilink
          arrow-up
          0
          ·
          22 days ago

          Amazing, thank you. That was my thought, check hashes while merging the files to keep any copies that might have been modified by DOJ and discard duplicates even if the duplicates have different metadata, e.g. timestamps.

    • thetrekkersparky@startrek.website
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      I’m downloading 8-11 now, I’m seeding 1-7+12 now. I’ve tried checking up on reddit, but every other time i check in the post is nuked or something. My home server never goes down and I’m outside USA. I’m working on the 100GB+ #9 right now and I’ll seed whatever you can get up here too.

  • TavernerAqua@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    19 days ago

    In regard to Dataset 9, it’s currently being shared on Dread (forum).

    I have no idea if it’s legit or not, and Idc to find out after reading about what’s in it from NYT.

  • shithawk@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    22 days ago

    I’ve hopped on the 10 mag, will be seeding all night and then some. This might be one of the healthiest swarms I’ve ever seen

  • hYcG68caGB7WvLX67@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    22 days ago

    I was quick to download dataset 12 after it was discovered to exist, and apparently my dataset 12 contains some files that were later removed. Uploaded to IA in case it contains anything that later archivists missed. https://archive.org/details/data-set-12_202602

    Specifically doc number 2731361 and others around it were at some point later removed from DoJ, but are still within this early-download DS12. Maybe more, unsure

    • susadmin@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      22 days ago

      The files in this (early) dataset 12 are identical to the dataset 12 here, which is the link in the OP. The MD5 hashes are identical.

      I shared a .csv file of the calculated MD5 hashes here

  • WhatCD@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    22 days ago

    I’m working on a different method of obtaining a complete dataset zip for dataset 9. For those who are unaware, for a time yesterday there was an official zip available from the DOJ. To my knowledge no one was able to fully grab it. But I believe the 49Gb zip is a partial of that before downloads got cut. It’s my thought that this original zip likely contained incriminating information and it’s why it got halted.

    What I’ve observed is that Akamai still serves that zip sporadically in small chunks. It’s really strange and I’m not sure why it does, but I have verified with strings that there are pdf file names in the zip data. I’ve been able to use a script to pull small chunks from the CDN across the entire span of the file’s byte range.

    Using the 49GB file as a starting point I’m working on piecing the file together, however progress is extremely extremely slow. If there is anyone willing to team up on this and combine the chunks please let me know.

    How to grab the chunked data:

    Script link: https://pastebin.com/ZqZGqkiH

    or

    Script with progress visualization: https://pastebin.com/UiCeTe3p

    For the script with the progress visualization you will probably have to:

    pip install rich
    

    Grab DATASET 9, INCOMPLETE AT ~48GB:

     magnet:?xt=urn:btih:0a3d4b84a77bd982c9c2761f40944402b94f9c64&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce 
    

    Then name the downloaded file 0-(the last byte the file spans).bin

    So for example the 48 GB file it would be: 0-48995762175.bin

    Next to the python script make a directory called: DataSet 9.zip.chunks

    Move the renamed first byte range 48 GB file in to that directory.

    Make a new file next to the script called cookies.txt

    Install the cookie editor browser extension (https://cookie-editor.com/)

    With the browser extension open go to: https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip

    The download should start in your browser, cancel it.

    Export the cookies in Netscape Format. They will copy to your clipboard.

    Paste those in your cookies.txt, save and close it.

    You can run the script like so:

    python3 script.py \
      'https://www.justice.gov/epstein/files/DataSet%209.zip' \
      -o 'DataSet 9.zip' \
      --cookies cookies.txt --retries 3 \
      --backoff 5.0 \
      --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' \
      -t 8 -c 2048
    

    Script Options:

    • -t - The number of concurrent threads to use which results in trying that many byte ranges at the same time.
    • -c - The chunk size to request from the server in MB. This is not always respected by the server and you may get a smaller or larger chunk, but the script should handle that.
    • --backoff - The backoff factor between failures, helps prevent Akimai throttling your requests.
    • --retries - The number of times to retry a byte range for that iteration before moving on to the next byte range. If it moves on it will come back to it again on the next loop.
    • --cookies - The path to the file containing your Netscape formatted cookies.
    • -o - The final file name. The chunks directory is derived from this so make sure it matches the name of the chunk directory that you primed with the torrent chunk.
    • --referer - Just leave this for Akimai, set the referer http header.

    There are more options if you tun the script with the --help option.

    If you start to receive HTML and or HTTP/200 responses then you need to refresh your cookie.

    If you start to receive HTTP/400 responses then you need to refresh your cookie in a different browser, Akamai is very fussy.

    A VPN and multiple browser might be useful to change your cookie and location combo.

      • WhatCD@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        21 days ago

        I would be interested in obtaining the chunks that you gathered and stitch them to what I gathered.

      • WhatCD@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        21 days ago

        Yeah :/ I haven’t been able pull anything in a while now. I was just able to pull 6 chunks, the data is still out there!

          • WhatCD@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            21 days ago

            What happens when you go to https://www.justice.gov/epstein/files/DataSet%209.zip in your browser?

              • WorldlyBasis9838@lemmy.world
                link
                fedilink
                arrow-up
                0
                ·
                21 days ago

                Can also confirm, receiving more chunks again.

                EDIT: Someone should play around with the retry and backoff settings to see if a certain configuration can avoid being blocked for a longer period of time. IP rotating is too much trouble.

                • WhatCD@lemmy.world
                  link
                  fedilink
                  arrow-up
                  0
                  ·
                  21 days ago

                  Updated the script to display information better: https://pastebin.com/S4gvw9q1

                  It has one library dependency so you’ll have to do:

                  pip install rich
                  

                  I haven’t been getting blocked with this:

                  python script.py 'https://www.justice.gov/epstein/files/DataSet%209.zip' -o 'DataSet 9.zip' --cookies cookie.txt --retries 2 --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' --ua '<set-this>' --timeout 90 -t 16 -c auto
                  

                  The new script can auto set threads and chunks, I updated the main comment with more info about those.

                  I’m setting the --ua option which let’s you override the user agent header. I’m making sure it matches the browser that I use to request the cookie.

              • WhatCD@lemmy.world
                link
                fedilink
                arrow-up
                0
                ·
                21 days ago

                Yeah when I run into this I’ve switched browsers and it’s helped. I’ve also switched IP addresses and it’s helped.

    • epstein_files_guy@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      22 days ago

      I’m using a partial download I already had and not the 48gb version but I will be gathering as many chunks as I can as well. Thanks for making this

  • jankscripts@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    22 days ago

    Heads up that the DOJ site is a tar pit, it’s going to return 50 files on the page regardless of the page number your on seems like somewhere between 2k-5k pages it just wraps around right now.

    Testing page 2000... ✓ 50 new files (out of 50)
    Testing page 5000... ○ 0 new files - all duplicates
    Testing page 10000... ○ 0 new files - all duplicates
    Testing page 20000... ○ 0 new files - all duplicates
    Testing page 50000... ○ 0 new files - all duplicates
    Testing page 100000... ○ 0 new files - all duplicates

    • WorldlyBasis9838@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      21 days ago

      I saw this too; yesterday I tried manually accessing the page to explore just how many there are. Seems like some of the pages are duplicates (I was simply comparing the last listed file name and content between some of the first 10 pages, and even had 1-2 duplications.)

      Far as maximum page number goes, if you use the query parameter ?page=200000000 it will still resolve a list of files. — actually crazy.

      https://www.justice.gov/epstein/doj-disclosures/data-set-9-files?page=200000000

    • jankscripts@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      21 days ago

      The last page I got a non-duplicate URL from was 10853 which curiously only had 36 URLs on page. When I browsed directly to page 10853 36 URLs were displayed but then moving back and forth in the page count the tar pit logic must have re-looped there and it went back to 50 Displayed. I ended with 224751 URLs

  • Arthas@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 days ago

    for DS9, does anyone have the following files:

      EFTA00709804
      EFTA00709805
      EFTA00709806
      EFTA00709807
      EFTA00770595
      EFTA00774768
      EFTA00823190
      EFTA00823191
      EFTA00823192
      EFTA00823221
      EFTA00823319
      EFTA00877475
      EFTA00892252
      EFTA00901740
      EFTA00912980
      EFTA00919433
      EFTA00919434
      EFTA00932520
      EFTA00932521
      EFTA00932522
      EFTA00932523
      EFTA00984666
      EFTA00984668
      EFTA01135215
      EFTA01135708
    

    If so, please DM me them and then I can include them in my master archive.