Does anyone on lemmygrad collect magazines besides me?

Lenom@lemmygrad.ml · 18 hours ago

Does anyone on lemmygrad collect magazines besides me?

Felhfeltetel ☭@lemmygrad.ml · edit-2 15 hours ago

Magazines, no. But I collect my party’s newspaper, called “Szabadság” (freedom) since I joined them. I wish it’s articles could be shared, but sadly translating it would need someone to do it as a full-time job.

Brunacho@feddit.cl · edit-2 11 hours ago

Same, I collect the Party’s newspaper, El Popular. In contrast to yours, the paper is fully online and free so I sometimes share articles in !rincon_hispano@lemmygrad.ml. It’s in Spanish though.

CriticalResist8@lemmygrad.ml · 14 hours ago

OCR a clear picture then run it through AI - some are better depending on the language, I’m not sure how well they handle hungarian. For OCR I recently started using MinerU, it’s an all-in-one python package that OCRs with a local model and preserves formatting and layout. Just installed through pip, if it runs on GPU it’s pretty efficient.

You may need to refine a prompt for the translation itself, it takes some trial and error to get something you’re happy with enough to lock in.

After you have the first translated pass with the LLM, you can even further have it refined by either editing it yourself or passing it through more prompts. I found that deepseek is better at editing than previously, and I’m sure other models fare well too. This is an editing prompt I used recently:

This text was translated by an LLM and the result is good, but reads very artificial. It was translated word-for-word and doesn’t sound very natural. Can you do a pass over the text to bring it to the next level? Do not change the ideas that the text conveys - the words have been picked carefully. However, you are allowed to reorder sentences and syntax to really make this text look like it was originally written in English. This is not a translation task anymore but localization.

(And then tell it to generate with the ‘three backticks’ so you can easily copy the output, the model knows what this means but just needs a push to do it I’ve noticed).

If there’s a lot of text you can even get AI to make you a pipeline with python scripts to handle batch processing. Then you’d only have to take the pictures of the newspaper. You can also have it be translated in any number of languages through scripts.

Or, just realized, but if they’re okay with it they can also hand you the raw text files 🤷‍♂️ (but I still recommend mineru for OCR lol)

LeninZedong@lemmygrad.ml · 5 hours ago

But AI is really bad for the environment, and is it even that good for translation?

CriticalResist8@lemmygrad.ml · edit-2 4 hours ago

I have to start with some pushback: I know you don’t mean it that way, but these types of comments are dismissive of all the work comrades put into using and developing AI models to solve actual problems, such as making more theory/marxist writing available (especially pertaining to local conditions). Every time a comrade talks about how they use AI to solve problems for the movement, there’s always one person that chimes in trying to rain on their parade, and I told myself some time ago that I would start calling it out when I see it.

The environmental impact is no worse than anything else you consume and use under capitalism already. It’s difficult to properly gauge the environmental impact of anything, especially when the service is digital and new like AI is. But there are much more pressing industries to tackle if we want to do something about the environment, and as communists we know that individual responsibility is a myth invented by fossil fuel companies. As soon as the word AI is pronounced some people retreat to liberal mythology about how there’s suddenly ethical consumption, change at the local level leads to change globally, and if people just had the right ideas and habits we would fix the world.

Us refusing to use AI for our own needs will not suddenly make it disappear. I wrote an essay about this; it will continue doing what it does regardless of how much we posture against it, because posturing does not produce material change. Should the bourgeoisie have the monopoly on this new technology, or should we get in there and develop and use it for our own goals?

As for translation work, I have used it in languages I speak and it works well enough. Otherwise I wouldn’t recommend it obviously. It’s not going to be as good as a professional translator would get, but:

a professional translator at the level that it sounds native will take weeks to translate just an essay. AI takes a couple minutes of waiting.
we are not all professional translators. Even speaking the language natively I don’t think I could do a better job than the AI.
hiring one is expensive. The cost is prohibitive for most people and only makes sense commercially.
it frees up time to do other things.
more importantly, it’s the difference between having something produced and not having anything.

Prior to LLMs I tried for 3 years to crowdsource a book translation. Our grand total of pages translated was 0 for a team of 12 at its peak. One night two of us took it to an old version of chatGPT and got it done in 2 hours.

This is a force multiplier for a movement who is always short on time and people.

Felhfeltetel ☭@lemmygrad.ml · 7 hours ago

Alright! Tysm! I am not a tech savvy person, so I will need a lot of time to digest and integrate such a system. I will see what I can do!

Thankfully, most newspapers get uploaded to their sites in .pdf form, so I think I can just copy the text from there I think. It was a good idea, that I could utilize Ai to help me out with such tasks and then just make sure it is good enough to be shared here.

Great, I will look into it! I wonder if Comrade Rainpizza is using a similar method to this when sharing KPRF texts in the Russia community.

Regardless of that, this could be done the other way around too (I am just thinking out loud), translating interesting articles you guys share on here and bringing it to my hungarian comrades’ attention.

Although since I am mostly solely just a computer user, I’d default to just Deepseek or Kimi or whatever for help, as I am not sure I could get MinerU running, but will try to!

Thanks again comrade!

CriticalResist8@lemmygrad.ml · edit-2 6 hours ago

Oh if you have a PDF you can still OCR it with minerU, it will be faster in fact because it can extract the text layer (I did the wretched of the earth in 4 minutes flat with it). The problem with PDFs is if you copy the text outright it will look weird because of how PDF handles text. MinerU is also content-aware, meaning it will remove the headers and footers if there are any, which is why I recommend it. It should also normally preserve tables (very important in some books) and styling such as italics and bold, which a simple copy-paste doesn’t. Basically if you copy PDF raw it looks like this:

[...]
You can see that reflected
in the products I have designed, which are often noted for their ease of use. The
most powerful things are simple. Thus this book proposes a simple and
straightforward theory of intelligence. I hope you enjoy it.
81
Artificial Intelligence
When I graduated from Cornell in June 1979 with a degree in electrical
engineering, I didn't have any major plans for my life. I started work as an
engineer at the new Intel campus in Portland, Oregon. The microcomputer
industry was just starting, and Intel was at the heart of it. My job was to analyze
and fix problems found by other engineers working in the field with our main
product, single board computers [...]

As you can see the lines break weirdly, there’s a random page number and chapter reminder in the middle, and it’s missing some bolded text (book is Jeff Hawkin’s On Intelligence).

LLMs can actually work very well with raw PDF text and clean it up for you, but if the text is really chopped up it might need a cleaner copy to start with. But maybe if you want to skip installing a bunch of stuff for minerU to work this could be attempted. Or like I said, if your party is open to the idea, ask them to send you the raw docx files which I’m sure they have (they probably import them into InDesign, and if they don’t, they should), and you can just upload that to deepseek and it will take care of the formatting for you.

Otherwise I’m putting the rest down here in a subsection:

Getting minerU to work

If you’re on windows (which I assume bc you say you are not tech-savvy) you will need to install python 3.12 here https://www.python.org/downloads/release/python-31312/ (scroll down at the bottom for windows installers). During installation make sure to have admin perms and check the “put python in PATH” checkbox or similar (it will say something about PATH).

Once python is installed you can install minerU by opening the cmd, and type pip install mineru[all] (or maybe python pip install mineru[all]). It will take some time but it will install minerU as needed on your computer

Once minerU is installed, in the same cmd window, run mineru-models-download. Once again it will take some time as it installs a bunch of models. Expect it to take around 11 gigabytes of memory on your disk in total.

Once everything is installed, you can simply run an OCR command through this command: mineru -p /path/to/your/document.pdf -o /path/to/output/folder -l [language] again from the CMD window. But you can do that at any time, you don’t need to reinstall everything we just did each time.

If at any point during the installation or trying to use minerU something isn’t clear or you get an error output in the CMD, just send the entire output to deepseek and it will tell you what to do. Use the expert mode with search on. I myself installed minerU through copying the commands Deepseek gave me, didn’t even need to hunt down anything. I ran into a bug then when trying to run it, sent the output to deepseek, and it found the fix in 2 seconds (installing python development version). I can’t stress how stress-free installing technical software has become.

But after that you can quickly and easily OCR any PDF or image on your computer, don’t forget to specify a specific folder for the output as minerU creates a bunch of files, including a markdown file and a JSON. That’s what its OCR output looks like. With pandoc, which is yet another piece of software to install, you can then transform that .md (markdown file) into another without a hitch. To install pandoc on Windows, download it here https://pandoc.org/installing.html (click ‘get the latest installer’ then look for pandoc-3.9.0.2-windows-x86_64.msi) and then you can use the command pandoc text.md -t -o conversion.[extension]

Pandoc is a really thorough program that can convert any text from one format to another, such as html to wikitext to markdown to epub to pdf to XML to whatever. You can find demos here that showcase some conversions: https://pandoc.org/demos.html. XML is what Word and LibreOffice use behind the .docx extension, so if you convert to XML you can probably easily open it in Word afterwards.

If I’m not mistaken if you put the [extension to your output file as .docx for example pandoc should automatically know that it has to convert to docx xml.

So basically minerU OCR’s the PDF into usable text, but in the markdown format. Then with pandoc, you can clone that markdown into a bunch of different other formats, if that makes sense. Keep the markdown format MinerU makes (which is the same styling language we use on Lemmygrad btw), and reconvert it with pandoc into anything you need.

I did all of this myself the other day for the Wretched of the Earth and it worked really well! Just needs some manual cleaning up afterwards, but that’s usually just on the chapter titles and because of the PDF files themselves.

MinerU can also run on your CPU (loaded in RAM) if your GPU can’t handle it, you can find the different options by just typing ‘mineru --help’, and it will tell you how to pick GPU or CPU. I think you need an Nvidia GPU, otherwise use CPU and it should work well too (just takes longer).

Fine-tuning your translation

I’ve been attempting translation work with LLMs for the past few years and I still haven’t found something I’m 100% happy with, though that 2-pass thing (first pass is the translation, second pass is a new conversation where you ask it to proofread and localize like an editor would) yields better results. This is kind of similar to the ‘critique’ they do in training, where you have the model being trained generating an output, and then another model ‘critiques’ it to find problems, and the model in training has to improve to fool the critique. I would try things around this concept, like sending a model both the original text and the LLM translation and asking it to compare, proofread, and fix.

LLMs are not great with all languages because they don’t necessarily train on those languages. So it’ll really depend on the model, you should try a few with the same prompt and input text (just one page of a book is fine, preferably one that is representative of the difficulty of the task). Then once you find a model that seems to handle hungarian fine, refine your prompt when you send it the text. It might be a very long prompt. You might have to include a glossary of technical terms that need to be translated the same way each time, and you might need to specify a bunch of other stuff like what sort of language register to use etc.

And basically you refine your prompt bit by bit like this until you get something that seems “good enough” for you. I find that it’s important to tell them to “write naturally without changing the content or the ideas - you are an editor, not an author” or something like that.

Once you’re happy with your prompt though you can save it somewhere on your computer to always have it around, and just reuse it each time.

As for language pairs yes it could probably work both ways. I.e. if you can get hungarian translated to english in good quality (by an LLM), you can probably get the LLM to also translate english to hungarian. older methods and humans are more finicky lol, but in my opinion LLMs should have no problem with language pairs as long as they know one of the two languages sufficiently.

Agentic pipeline

The pipeline python scripts I was talking about is, you guessed it, more LLM stuff. Join us over on !crushagent@lemmygrad.ml to learn how to start using agentic on your computer. But basically put 5$ in the deepseek API, install crush, and then have the agent code you a bundle of scripts to automate most of the process for you. It’s what I did to get ProleWiki translated to French, it’s a collection of 4 different python scripts, all LLM-coded, to 1. download our pages, 2. translate them intelligently with an LLM (with progress tracking, cutting up big files into chunks etc), 3. clean up the translation artefacts due to the model and 4. upload the translated pages to ProleWiki.

You don’t need to know computers or how to code anymore to have this kind of stuff and I think that’s pretty cool. It definitely helps but for something simple like that you don’t need to be too technical. The code might get complex, but you let the AI handle it. You’re the client for the script, you don’t need to know how it works, just that it does.

It’s more involved but then you could have a mostly automated pipeline that runs minerU on the pdfs, sends them to an LLM API to get them translated (like I have), then labels and saves the translations or something. That way instead of doing every step yourself you just run the script.

But if you have PDFs you could probably just feed them manually to Deepseek tbh, by just uploading them in the chatbox.