In this post
Exploring 3 Machine Learning Types – As a data scientist working for close to six years now at LINDERA I regularly encounter questions and concerns about “AI”, be that from family, friends or sometimes customers.
This exploded with the rise of ChatGPT, Dall.E 2 and similar programs that can generate human made like content. Even the question of sentience was thrown around when a google researcher who had worked on a language model claimed it might have become sentient.
So to clear up some misconceptions about what “AI (Artificial Intelligence)” can and cannot do I assembled a brief overview over different learning methods used today and why they matter. First of all the naming can be cleared up, because what a data scientist works on is not artificial intelligence, but machine learning. The difference being that it is more broader and specific at the same time. Machine learning is limited to, well, machines, so artificial biological brains are out, but on the flip side machine learning doesn’t need to create something that could actually be classified as intelligent.
What is machine learning?
Generally machine learning solves problems that look for a function that takes a complex input and produces the desired output. Once found these functions can be used over and over again using only a fraction of the computations needed to initially find them. ChatGPT and most other systems currently in use are at their core exactly such functions. Complex to be sure, but not changing or learning while they are in use.
To find such functions one always begins by defining the the input format and output format. So in case of for example the system used in LINDERA we take as input a video and want as output the 3D positions of the target persons joints. Then comes a definition of the structure of the function to be used.
With the current tools this is mostly done in the form of neural networks, though other forms do exist. And lastly one needs data, lots and lots of data. To be more specific one needs pairs of inputs and target values.
Through these decisions of input, output and also of where the data comes from one can classify different kinds of learning. The most common one is supervised learning, where the data needs to be prepared by a different system, either humans creating the output, for example through captcha’s, or from a different measuring system. In our case, to continue using the example, the data for the 3D joints came from a 3D Motion capturing system that uses a way more elaborate and way more expensive setup than a simple smartphone camera.
This learning method is called supervised, because it is basically “supervised“ by the system that generated the data, because this system defined what is right and wrong through the data.
On the other side is the unsupervised learning, where either the desired output can be directly inferred from the input, or there is a dynamic system that can immediately give feedback for whatever output that was produced by the network in training.
One might wonder what use the first of these options is, when we can already infer the desired output through different means, but those neural networks are usually not trained for the output, but for what happens with the input to achieve the output. For example Autoencoders are tasked with reproducing their input as their output, while limited in the amount of data that can pass through them effectively learning a compression function somewhere inside of them.
This compression is then commonly used as the starting point for other tasks that use the same input.
The second of these unsupervised learning approaches can again be split into reinforcement learning and generative networks. In reinforcement learning the target output is given through a an interactive environment, most of the time simulated but a real world one would be possible as well. Basically the network will get information of the current state of the environment and then has to take an action inside of this environment.
As soon as it does so the environment will change and give the network feedback, which can then be used to improve the network to take a better action next time. Although it still holds that in most use cases the function or network is trained once and then used unchanged this method could have the potential to continually learn, though recent attempts show that our real world environment is still too complex as shown by the twitter bot Tay deployed by Microsoft in 2016 that became viciously racist after it was subjected to nefarious users.
Lastly there are the generative networks that are a bit of a mix of supervised and unsupervised learning though the majority of the task is to be found in the unsupervised part. For generative learning a network is trained supervised to classify target outputs.
For example it shall give the probability that any given picture was created by Van Gogh. Then a second neural network is trained by giving it random numbers as input and giving it the feedback the first network gives for the output it produced.
So in our example the second network would produce images and get as feedback how much they look like they were made by Van Gogh. So over time the second network will produce images that look more and more like they were made by Van Gogh until we say we are satisfied with the result and finalize the network.
Conclusion on machine learning types
ChatGPT and Dall.E 2 belong to this final category of learners and therefore will not improve anymore until it is retrained with more data. So no immediate permanent learning occurs.
Now avid users might say “wait a second but chat gpt referenced what it wrote itself in this conversation, how does that work if it doesn’t learn?“ and the answer to that is that ChatGPT uses the history of the current conversation as part of the input and hence knows how to continue the conversation. But as of yet it does not learn from them. It does not change it’s function, only the inputs and hence will only continue to do what it was trained for and not send terminators back in the past.
For us at LINDERA that means that since the function of our networks does not change just by us using it, that we can assure the quality of the results our neural network produce by testing their results during our release process. And since only the inputs matter that those tests need to include videos taken “in the wild”, meaning at the intended user environment. Which is an extra step we are happy to take, since it ensures the safety of our users and helps them stay mobile.
Thanks for reading!
And seeing as this is my first blog post of this kind I am of course open for questions and constructive criticism.