Does anyone know of full lists of the files?

untitled_backer@lemmy.ml · 17 days ago

Does anyone know of full lists of the files?

TropicalDingdong@lemmy.world · 17 days ago

There is an active discussion on it here:

https://github.com/yung-megafone/Epstein-Files

Also, at least @ermstein@lemmy.world seems to be contributing:

https://github.com/yung-megafone/Epstein-Files/issues/4

I’ve got at least 1-9 right now, but its messier and harder to get than you might imagine. I would say we’re close but not all of the way as to getting an exhaustive list, and we know through the work of congress that there are millions of files not included.

untitled_backer@lemmy.ml · 16 days ago

Good link, thank you, that’s awesome! I’m going to use that

TropicalDingdong@lemmy.world · 16 days ago

good luck. I’ve had some issues with the torrents fyi. Especially Dataset 9, 10, 11, 12

untitled_backer@lemmy.ml · 16 days ago

Trouble?

TropicalDingdong@lemmy.world · 16 days ago

Not unzipping, corrupt, all kinds. I’ve had to redownload many times. Tried several magnets. Extracting issues.

It’s just a shit ton of data. 9 is still in a haphazard state.

untitled_backer@lemmy.ml · 16 days ago

Have you seen this?

https://github.com/yung-megafone/Epstein-Files

TropicalDingdong@lemmy.world · 16 days ago

I… I posted that?

untitled_backer@lemmy.ml · 15 days ago

… So you did. I guess I was responding to your comment, forgot to read the thread. That’s embarrassing. Not sure why you had problems. Are you still having trouble?

TropicalDingdong@lemmy.world · 15 days ago

I got a version of 9 done. I think its the best up to date version, but not sure yet. This is… its an extraordinary amount of data.

I’ve got a 42 TB NAS and a processing machine that I can run up to 128gb vram machine learning models on locally.

I’m trying to use datashare (https://datashare.icij.org/) to organize/ index the records, but its been a bit of a disaster. I’m constantly having to restart/ rebuild the docker container because it gets into a bad/ hung state.

I had originally planned to just develop a postgres to index/ support analysis, but thought datashare could simplify this. Its not been good. It hangs when indexing documents. Also, none of the plug-ins seem to really work, but I appreciate it as an opensource concept.

I suppose I could just be hitting the .justice files directly… but that seems problematic for several other reasons. First, document integrity. I don’t trust them. Two, tracking. They’ll almost certainly be able to reverse engineer a list of who is examining this data.

All in all, I have the gear to do this homelab style, I have the analysis expertise, and even though I shouldn’t, I can put time into doing aspects of this.

I still need to follow up on datasets 11 and 12, but I’ve got 1-10 extracted and indexed.

I could use help though, if in nothing else, to have a conversation partner. Right now I’ve been focused almost exclusively on getting the data onto the stacks and figuring out an indexing solution. Beyond that I’ve poked around the .justice site and while listening to podcasts have pursued some keywords. But those don’t compose a coherent analytical framework.

I did this a while back: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

Now these have been OCR"d already, but honestly, its kinda shit, and there are some real gaps in these data. In the codeberg example, I did that with rented compute. If I use my machine, I can push MUCH harder.

My thinking first is to try and collate entities, emails, phone numbers, ip’s, and addresses. Then separately, dates and times. Right now one of the most difficult challenges with these data is an inability to sort by time. I’d like to address that.

My thinking is to build out a postgres db of these. This is going to require some fuzzy matching for partial reads. We can assume ALL OCR is going to fail to some degree.

Another reason to take a fuzzy matching approach would be to try and in-fill/ de anonymize the redactions. There are enough flaws and faults the manner of the redactions, when you get some sets of documents, you can effectively infer and fill in what should go in the gaps.

Anyways. What would be extremely helpful would be to have some conversations on how to approach this.