it is only open source if i can build it myself. Which I can’t if you just give me the weights.
The weights are the “compiled” version of the dataset. It’s the dataset that’s the source, not the weights
@vrighter @ylai
That is a really bad analogy. If the “compilation” takes 6 months on a farm of 1000 GPUs and the results are random, then the dataset is basically worthless compared to the model. Datasets are easily available, always were, but if someone invests the effort in the training, then they don’t want to let others use the model as open-source. Which is why we want open-source models. But not “openwashed” where they call it “open” for non-commercial, no modifications, no redistributionthe results are random therefore the dataset is useless.
tell that to any fpga toolchain
The situation is somewhat different and nuanced. With weights there are tools for fine-tuning, LoRA/LoHa, PEFT, etc., which presents a different situation as with binaries for programs. You can see that despite e.g. LLaMA being “compiled”, others can significantly use it to make models that surpass the previous iteration (see e.g. recently WizardLM 2 in relation to LLaMA 2). Weights are also to a much larger degree architecturally independent than binaries (you can usually cross train/inference on GPU, Google TPU, Cerebras WSE, etc. with the same weights).
How is that different then e.g. patching a closed-sourced binary? There are plenty of community patches to old games to e.g. make them work on newer hardware. Architectural independence seems irrelevant, it’s no different than e.g Java bytecode.
This is a very shallow analogy. Fine-tuning is rather the standard technical approach to reduce compute, even if you have access to the code and all training data. Hence there has always been a rich and established ecosystem for fine-tuning, regardless of “source.” Patching closed-source binaries is not the standard approach, since compilation is far less computational intensive than today’s large scale training.
Java byte codes are a far fetched example. JVM does assume a specific architecture that is particular to the CPU-dominant world when it was developed, and Java byte codes cannot be trivially executed (efficiently) on a GPU or FPGA, for instance.
And by the way, the issue of weight portability is far more relevant than the forced comparison to (simple) code can accomplish. Usually today’s large scale training code is very unique to a particular cluster (or TPU, WSE), as opposed to the resulting weight. Even if you got hold of somebody’s training code, you often have to reinvent the wheel to scale it to your own particular compute hardware, interconnect, I/O pipeline, etc… This is not commodity open source on your home PC or workstation.
The analogy works perfectly well. It does not matter how common it is. Pstching binaries is very hard compared to e.g. LoRA. But it is still essentially the same thing, making a derivative work by modifying parts of the original.
How does this analogy work at all? LoRA is chosen by the modifier to be low ranked to accommodate some desktop/workstation memory constraint, not because the other weights are “very hard” to modify if you happens to have the necessary compute and I/O. The development in LoRA is also largely directed by storage reduction (hence not too many layers modified) and preservation of the generalizability (since training generalizable models is hard). The Kronecker product versions, in particular, has been first developed in the context of federated learning, and not for desktop/workstation fine-tuning (also LoRA is fully capable of modifying all weights, it is rather a technique to do it in a correlated fashion to reduce the size of the gradient update). And much development of LoRA happened in the context of otherwise fully open datasets (e.g. LAION), that are just not manageable in desktop/workstation settings.
This narrow perspective of “source” is taking away the actual usefulness of compute/training here. Datasets from e.g. LAION to Common Crawl have been available for some time, along with training code (sometimes independently reproduced) for the Imagen diffusion model or GPT. It is only when e.g. GPT-J came along that somebody invested into the compute (including how to scale it to their specific cluster) that the result became useful.
So the cover art I made for a friend’s album isn’t open source, even though I released it as CC BY-SA… because you can’t make it yourself?
I would consider the “source code” for artwork to be the project file, with all of the layers intact and whatnot. The Photoshop PSD, the GIMP XCF or the Krita KRA. The “compiled” version would be the exported PNG/JPG.
You can license a compiled binary under CC BY if you want. That would allow users to freely decompile/disassemble it or to bundle the binary for their purposes, but it’s different from releasing source code. It’s closed source, but under a free license.
I think technically, the source should be the native format of whatever image manipulation program that you use. For vector graphics, there is svg format but the native editor is still preferable. Otherwise, whoever gets the end copy cannot easily modify or reproduce it, only copy it. But it of course depends on the definition of “easy” and a lot of other factors. Licensing is hard and it is because I am not a lawyer.
It would depend on the format what is counted as source, and what isn’t.
You can create a picture by hand, using no input data.
I challenge you to do the same for model weights. If you truly just sit down and type away numbers in a file, then yes, the model would have no further source. But that is not something that can be done in practice.
I challenge you to recreate the Mona Lisa.
My point is that these models are so complex that they’re closer to art than anything reproduce
I don’t see your point? What is the “source” for Mona Lisa I would use? For LLMs I could reproduce them given the original inputs.
Creating those inputs may be an art, but so could any piece of code. No one claims that code being elegant disqualifies it from being open source.
Are you sure that you can reproduce the model, given the same inputs? Reproducibility is a difficult property to achieve. I wouldn’t think LLMs are reproduce.
In theory, if you have the inputs, you have reproducible outputs, modulo perhaps some small deviations due to non-deterministic parallelism. But if those effects are large enough to make your model perform differently you already have big issues, no different than if a piece of software performs differently each time it is compiled.
That’s the theory for some paradigms that were specifically designed to have the property of determinism.
Most things in the world, even computers, are non-deterministic
Nondeterminism isn’t necessarily a bad thing for systems like AI.
you released it under a non open source license. So very clearly: no it is not
Wut. That license is literally compatible with the GPL
CC BY-SA is considered open source. CC BY-NC is not.
This needs to have multiple levels of “openness” to distinguish between having access to the code, the dataset, a documented training procedure, and the final weights. I wouldn’t consider it fully open unless these are all available, but I still appreciate getting something over nothing, and I think that should be encouraged.
Years ago I found myself explaining to Chinese Room dinguses - in a neural network, the part that does stuff is not the part written by humans.
I’m not sure it’s meaningful to say this sort of AI has source. You can have open data sets. (Or rather you can be open about your data sets. I don’t give a shit if LLMs list a bunch of commercial book ISBNs.) But rebuilding a network isn’t exactly a matter of hitting “compile” and going out for coffee. It can take months, and the power output of a small city… and it still can’t be exact. There’s so much randomness involved in the process that it’d be iffy whether you get the same weights twice, even if you built everything around that goal.
Saying “here’s the binary, do whatever” is honestly a lot better for neural networks than for code, because it’s not like the people who made it know how it works either.
My issue will be when OSI deems something as nonfree simply for adding that
NC
for non-commercial labels so the corporations can’t abuse the Commons.i feel like it’s okay that they do this, but i don’t like the term “source available”. maybe something like “Free for Non-Commercial Use” or “FOSS-NC”?
The free software banshees will call it all proprietary… It’s not that it doesn’t make sense to draw different lines, but when folks treat OSI with a lot of reverence & if they say it doesn’t match their definition, folks want want to use it or release under these titles. “Source available” is also roped in with the we-get-a-monopoly licenses & gets knocked down a peg as if “open source” is the pinnacle of freedom despite the Commons being ransacked by corporations not giving back monetary support or contributions for the labor.
“source available” licenses are making the commons MORE ransacked by corporations. Which direction do you want to go?
This isn’t binary. If you shriek that all things that aren’t open source are the same, then you will miss all the nuance. There is a difference between what Redis just did & copyfair or copyfarleft or Creactive Common Non-Comercial are suggesting.
@toastal I don’t need to compare each license to each other and get lost in wicked little words, arguing with anonymous accounts on the internet. I can instead see which change was a move towards, or away from, a world ransacked by corporations. That is clearly binary. Would you argue that Redis made the world less ransacked by their license change?
Redis isn’t doing what I would like to see more of in the world. Kicking out the profit & capital is not the same as trying to maintain your monopoly like Redis. Open source has often failed us… & instead we see compromises like AGPL which is restricting the “4 Freedoms” due to corporate exploitation. It’s a form of weak copyfarleft as far as I am concerned & everyone knows its license is a bit weird, but not looking at the root cause which isn’t network usage, but general exploitation from the capitalists.