I see a lot of fragmented datasets out there, does anyone know of something comprehensive (e.g. all files from all datasets) who is annotating the files and accepting submissions?

  • untitled_backer@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    8 days ago

    … So you did. I guess I was responding to your comment, forgot to read the thread. That’s embarrassing. Not sure why you had problems. Are you still having trouble?

    • TropicalDingdong@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 days ago

      I got a version of 9 done. I think its the best up to date version, but not sure yet. This is… its an extraordinary amount of data.

      I’ve got a 42 TB NAS and a processing machine that I can run up to 128gb vram machine learning models on locally.

      I’m trying to use datashare (https://datashare.icij.org/) to organize/ index the records, but its been a bit of a disaster. I’m constantly having to restart/ rebuild the docker container because it gets into a bad/ hung state.

      I had originally planned to just develop a postgres to index/ support analysis, but thought datashare could simplify this. Its not been good. It hangs when indexing documents. Also, none of the plug-ins seem to really work, but I appreciate it as an opensource concept.

      I suppose I could just be hitting the .justice files directly… but that seems problematic for several other reasons. First, document integrity. I don’t trust them. Two, tracking. They’ll almost certainly be able to reverse engineer a list of who is examining this data.

      All in all, I have the gear to do this homelab style, I have the analysis expertise, and even though I shouldn’t, I can put time into doing aspects of this.

      I still need to follow up on datasets 11 and 12, but I’ve got 1-10 extracted and indexed.

      I could use help though, if in nothing else, to have a conversation partner. Right now I’ve been focused almost exclusively on getting the data onto the stacks and figuring out an indexing solution. Beyond that I’ve poked around the .justice site and while listening to podcasts have pursued some keywords. But those don’t compose a coherent analytical framework.

      I did this a while back: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

      Now these have been OCR"d already, but honestly, its kinda shit, and there are some real gaps in these data. In the codeberg example, I did that with rented compute. If I use my machine, I can push MUCH harder.

      My thinking first is to try and collate entities, emails, phone numbers, ip’s, and addresses. Then separately, dates and times. Right now one of the most difficult challenges with these data is an inability to sort by time. I’d like to address that.

      My thinking is to build out a postgres db of these. This is going to require some fuzzy matching for partial reads. We can assume ALL OCR is going to fail to some degree.

      Another reason to take a fuzzy matching approach would be to try and in-fill/ de anonymize the redactions. There are enough flaws and faults the manner of the redactions, when you get some sets of documents, you can effectively infer and fill in what should go in the gaps.

      Anyways. What would be extremely helpful would be to have some conversations on how to approach this.