The Gemma License is a home-grown “open-source” license, meaning Google created its own, entirely new license specifically for Gemma, rather than choosing from a pre-existing open-source license, like the MIT License, for example. This is not an unusual move when a company creates a new open-source project; this is a similar approach used by Meta for example, with its home-grown LLaMa 3.1 license.
Is the Gemma License an Open-Source License?
By some measures, yes, but it’s a little complicated and for most intents and purposes it doesn’t actually matter. Before we dive into the details however, many readers are Googling this question because they want to build a commercial application on top of Gemma – TL;DR: this is ok for most use cases.
What does “open-source” even mean in the Gen AI context?
It may come as a surprise to some that this question is currently the subject of much debate. The Open Source Initiative (OSI), a non-profit standards organization that’s been a thought leader in the open source space for decades (read, “OG open source people”), has been working on a definition of “Open Source AI.” Here you can read their latest attempt at a definition of “Open Source AI.”
If you look at a generative AI model (remember, we are not talking about discriminative models), it has 3 components:
- source code of the model itself
- model weights and parameters
- data used to originally train the model (aka the training data)
Basically the hubbub is around #3. As discussed in the Gen AI open-source licensing article, there are a few schools of thought with respect to the training data:
- You don’t need it in order for a Gen AI model to be called “open source”
- You need a detailed description of it in order for a Gen AI model to be called “open source”
- You need to make the training data available along with all the other stuff in order for a Gen AI model to be called “open source”
OSI believes you need a detailed description of the training data to accompany the other open-source components in order for a Gen AI model to be called “open source”: “Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.”
I’m a little more hardcore than the OSI folks, I suppose. My opinion is that the training data itself must be made available for download in order for a Gen AI model to be truly be called “open source.” My reasoning is fairly simple: “open-source software”, as it was originally defined (or at least, understood), described a process whereby a compiled program (binaries) could be recreated in its entirety by anyone because all of the requisite components were available for download.
It’s worth noting that my (admittedly dogmatic) opinion disqualifies huge numbers of self-described “open-source” Gen AI models from being accurately described as such. There are plenty of models, however, that do disclose and provide links to their training data. For example, some models are trained solely on the Common Crawl and duly link to it in their distribution notes.
What’s weird about the Gemma 2 License?
It’s important to understand that most common, open-source software licenses cover the same basic concepts and do so in a similar way. If someone, or some company, is ginning up their own “open source” license in 2024 it’s probably because they have a specific need or concern and the pre-existing licenses (of which there are thousands) don’t fit the bill, if you will. Here’s the part of the Gemma License that is “non-standard”:
“You must not use any of the Gemma Services:
- for the restricted uses set forth in the Gemma Prohibited Use Policy at ai.google.dev/gemma/prohibited_use_policy (“Prohibited Use Policy”), which is hereby incorporated by reference into this Agreement; or
- in violation of applicable laws and regulations.”
The Gemma Prohibited Use Policy is, in my opinion, industry-standard as AI use policies go (and I’ve read a few…) and contains nothing unusual or objectionable – in fact, to the extent its contours fall outside “don’t violate applicable law,” those portions of the Policy are in furtherance of values that 99% of humans, including I, hold dear (e.g., “don’t bully people”).
Anyway, for readers interested in buildings apps on top of a self-hosted instance of Gemma 2, unless your use case involves breaking the law or other nefarious end uses, you should be fine – build away, ye.
But the comment section on Reddit / HackerNews said I can’t use Gemma for commercial applications?
They did, did they? I think it’s possible that some folks are misunderstanding the fact that Google’s Terms of Service (which, for example, restrict “using AI-generated content from our services to develop machine learning models or related AI technology”) apply to services hosted by Google itself, and are separate and distinct from the Gemma 2 License, which applies to self-hosted uses of Gemma 2 (i.e., when a software developer downloads Gemma 2 and uses it locally or on their own server to power a component/feature of their app).