I see a lot of fragmented datasets out there, does anyone know of something comprehensive (e.g. all files from all datasets) who is annotating the files and accepting submissions?

  • TropicalDingdong@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    17 days ago

    There is an active discussion on it here:

    https://github.com/yung-megafone/Epstein-Files

    Also, at least @ermstein@lemmy.world seems to be contributing:

    https://github.com/yung-megafone/Epstein-Files/issues/4

    I’ve got at least 1-9 right now, but its messier and harder to get than you might imagine. I would say we’re close but not all of the way as to getting an exhaustive list, and we know through the work of congress that there are millions of files not included.

          • TropicalDingdong@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            16 days ago

            Not unzipping, corrupt, all kinds. I’ve had to redownload many times. Tried several magnets. Extracting issues.

            It’s just a shit ton of data. 9 is still in a haphazard state.

                • untitled_backer@lemmy.mlOP
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  15 days ago

                  … So you did. I guess I was responding to your comment, forgot to read the thread. That’s embarrassing. Not sure why you had problems. Are you still having trouble?

                  • TropicalDingdong@lemmy.world
                    link
                    fedilink
                    English
                    arrow-up
                    2
                    ·
                    15 days ago

                    I got a version of 9 done. I think its the best up to date version, but not sure yet. This is… its an extraordinary amount of data.

                    I’ve got a 42 TB NAS and a processing machine that I can run up to 128gb vram machine learning models on locally.

                    I’m trying to use datashare (https://datashare.icij.org/) to organize/ index the records, but its been a bit of a disaster. I’m constantly having to restart/ rebuild the docker container because it gets into a bad/ hung state.

                    I had originally planned to just develop a postgres to index/ support analysis, but thought datashare could simplify this. Its not been good. It hangs when indexing documents. Also, none of the plug-ins seem to really work, but I appreciate it as an opensource concept.

                    I suppose I could just be hitting the .justice files directly… but that seems problematic for several other reasons. First, document integrity. I don’t trust them. Two, tracking. They’ll almost certainly be able to reverse engineer a list of who is examining this data.

                    All in all, I have the gear to do this homelab style, I have the analysis expertise, and even though I shouldn’t, I can put time into doing aspects of this.

                    I still need to follow up on datasets 11 and 12, but I’ve got 1-10 extracted and indexed.

                    I could use help though, if in nothing else, to have a conversation partner. Right now I’ve been focused almost exclusively on getting the data onto the stacks and figuring out an indexing solution. Beyond that I’ve poked around the .justice site and while listening to podcasts have pursued some keywords. But those don’t compose a coherent analytical framework.

                    I did this a while back: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

                    Now these have been OCR"d already, but honestly, its kinda shit, and there are some real gaps in these data. In the codeberg example, I did that with rented compute. If I use my machine, I can push MUCH harder.

                    My thinking first is to try and collate entities, emails, phone numbers, ip’s, and addresses. Then separately, dates and times. Right now one of the most difficult challenges with these data is an inability to sort by time. I’d like to address that.

                    My thinking is to build out a postgres db of these. This is going to require some fuzzy matching for partial reads. We can assume ALL OCR is going to fail to some degree.

                    Another reason to take a fuzzy matching approach would be to try and in-fill/ de anonymize the redactions. There are enough flaws and faults the manner of the redactions, when you get some sets of documents, you can effectively infer and fill in what should go in the gaps.

                    Anyways. What would be extremely helpful would be to have some conversations on how to approach this.