LMM Large Multimodal Model AI

rainphakir · April 26, 2024, 5:01pm

We’ve all heard of LLM (Large Language Model) and ChatGPT3, 4.
And how that’s revolutionizing internet search and term papers.

Get ready for LMM - Large Multimodal Model.

{ A Large Multimodal Model (LMM) is an advanced type of artificial intelligence that can understand and generate content across multiple types of data, such as text, images, audio, and video. Think of it as a highly skilled artist who is not just a master painter but can also compose music, write captivating stories, and direct movies—all with a deep understanding of each medium’s nuances. }

Once this LMM technology matures … Lots of amazing conveniences will manifest?

Here’s Neil deGrasse Tyson POV.

ralph

steve203 · April 26, 2024, 5:16pm

…and that is why books were banned, in “Fahrenheit 451”, in “1984”…

Steve

captainccs · April 26, 2024, 5:38pm

Think of it as a highly skilled scientist. Nobel Prize winning theoretical physicist, Richard Feynman, said, “It starts with a guess…”

Then you apply the scientific method to the guess.

It is important to realize that AI and LLMs are highly creative but not necessarily accurate. Good ones have a high hit rate but seldom if ever perfect. Self driving will kill some people, hopefully fewer than human drivers kill.

The Captain

rainphakir · April 26, 2024, 6:02pm

I’m wondering if LMMs will democratize autonomous driving, to the extent that Tesla will NOT be the ONLY “FSD”?

Nvidia has hardware and Cuda. And AI scientists.

Tesla went to neural nets and decreased the “code” dramatically.

Will LMM allow competitors to “easily” challenge Tesla FSD?

ralph

captainccs · April 26, 2024, 7:22pm

LLMs are one application of Neural Network based AI. FSD is another. Same basic technology. Since it’s a technology that mimics how humans learn there are only two reasons why competitors cannot match Tesla’s FSD, money to fund the required computing power, and data to feed it.

At this time Tesla seems to be second only to Meta in computing power and orders of magnitude ahead of all car companies in driving data. With legacy automakers losing money on EVs the moat is quite powerful for the time being. A self driving car to be really useful has to be able to communicate verbally. Elon’s X has the LLM that could take care of that to complement Tesla’s FSD.

The Captain

rainphakir · April 26, 2024, 8:02pm

LMMs encompass both LLM and other data inputs?

FSD trained on videos MIGHT be termed LVM (Large Video Model)?

And a LMM (L Multimodal M) would encompass both LLM and LVM and other modalities?

I’m wondering if at some point an ASI (Artificial Super Intelligence, the fearsome take-over-the-world AI entity that comes after AGI) using an LMM might be able to “learn to drive” by watching a few thousand videos (real world and simulated), reading the “how to” text. And taking the test.

No need for “billions and billions of miles of driving videos”.

ralph

Pilots use flight simulators to learn to fly X aerial vehicle.

16yo teens learn to drive by watching mom/dad, taking a course, maybe watching a couple hours of video… And taking a test gown at the DMV.

An LMM neural net seems capable of the same?

captainccs · April 26, 2024, 8:46pm

You are making this much too complicated.

The problem with heuristics is the inability to cover all edge cases. Video simulations are effectively ‘heuristics’ because they are man made. The ‘billions and billions of miles of driving’ are an attempt to capture as many unusual edge cases as possible, stuff that most people can’t even imagine.

About flight simulators, I love Mentour Pilot. In the last video I watched, he said that the situation in the cockpit was never practiced in simulator training. Same problem, the training is heuristics based, not real world. After real world happens they add it to the training program.

The Captain

bjurasz · April 26, 2024, 8:53pm

This is where randomness comes into play with these simulations, what can catch lots of edge cases, and can create scenarios people can’t even imagine. More importantly you’ll veer off into scenarios and edge cases that people would never actually do in real-world driving. In other words, you’ll create training data that people would never do on their own but are important to hit anyway.

We do this all the time in the verification of CPUs and GPUs. It’s called directed random simulation. You can’t verify today’s designs without it.

captainccs · April 26, 2024, 9:12pm

As a developer a long time ago I was aware of the difficulty of testing and validating code with the variety of software and hardware the code could possibly encounter in real life. That was long before " directed random simulation."

Is this what you are talking about?

Abstract

Directed Random Testing

Random testing can quickly generate many tests, is easy to implement, scales to large software applications, and reveals software errors. But it tends to generate many tests that are illegal or that exercise the same parts of the code as other tests, thus limiting its effectiveness. Directed random testing is a new approach to test generation that overcomes these limitations, by combining a bottom-up generation of tests with runtime guidance. A directed random test generator takes a collection of operations under test and generates new tests incrementally, by randomly selecting operations to apply and finding arguments from among previously-constructed tests. As soon as it generates a new test, the generator executes it, and the result determines whether the test is redundant, illegal, error-revealing, or useful for generating more tests. The technique outputs failing tests pointing to potential errors that should be corrected, and passing tests that can be used for regression testing. The thesis also contributes auxiliary techniques that post-process the generated tests, including a simplification technique that transforms a failing test into a smaller one that better isolates the cause of failure, and a branch-directed test generation technique that aims to increase the code coverage achieved by the set of generated tests.

Applied to 14 widely-used libraries (including the Java JDK and the core .NET framework libraries), directed random testing quickly reveals many serious, previously unknown errors in the libraries. And compared with other test generation tools (model checking, symbolic execution, and traditional random testing), it reveals more errors and achieves higher code coverage. In an industrial case study, a test team at Microsoft using the technique discovered in fifteen hours of human effort as many errors as they typically discover in a person-year of effort using other testing methods.

https://groups.csail.mit.edu/pag/pubs/randomtesting-pacheco-phdthesis-abstract.html

The Captain

rainphakir · April 26, 2024, 10:05pm

Heuristics is “coding every case”, edge or common, and everything in between? Yes?

Hundreds of thousands if not millions of lines of code? Yes?

Is that how LLMs, LMMs, and AGI, (and eventually ASI), work?

If so… Then that’s NOT what I’m envisioning.

I thought they are now based on neural nets.
And they are perhaps a really advanced AI “agent” .

Back to learning to drive.
A 16yo teen learns to drive by watching mom/dad, other drivers, takes the drivers Ed course, watches some videos, reads the text, and takes a test.

She has NOT seen every “edge” case, yet is given permission to drive.

Her neural net provides the flexibility to adapt her actual experience to the “new, edge case” experiences she encounters on her journey.

Isn’t this a power law thing?
She has 80% of the needed knowledge, and will acquire more as time goes by and she gets experience with more “edgy” cases?
The “law” says she has the basic knowledge required to be allowed to drive. So… She drives.

An ASI (neural net) using LMMs (Multimodal, neural nets) in my imagination can do as well.

Llama (Metas LLM

ralph. Yeah I’m thinking a little wider and farther out than most comments I read/ hear.

captainccs · April 26, 2024, 10:51pm

Yes. I mentioned heuristics to explain why simulations are not likely to cover lots of edge cases. I was not aware of “directed random simulation.” That might help but I have my doubts that it can get them all. In any case, after a time there will be enough real world data to make simulations unnecessary.

The 16yo teen has been watching lots of people driving for well over a decade, mom, dad, bus drivers, and lots of movies with cars.

Her neural net has the accumulated learning of having survived since birth, a huge advantage over FSD AI.

I don’t think so.

During her 16 years she has probably ridden a bicycle and other contraptions. Crossed roads. Seen how street lights and signs work. Whether she drives or not she already was a vast storage of useful learning. It’s not as she was an empty driving bucket at 16. Long before 16 I could identify cars by their tail lights.

The Captain

Leap1 · April 27, 2024, 1:27am

Ralph silicon valley would never repackage things for sales purposes.

This must be the biggest break through since sliced bread.

rainphakir · April 29, 2024, 10:35pm

WRT running LLMs, (and LMMs), I keep hearing some $ amount per thousands or millions of “tokens”.

Tokens can refer to cryptocurrency units, but that didn’t seem to fit the LLM context.

In the LLM context here’s what I found:

{ When you are dealing with LLMs, you often come across the terms “vectors,” “tokens” and “embeddings.”

Snip ooooooooo

In mathematics and physics, a vector is an object that has both magnitude and direction.

Snip oooooooooo

In the realm of LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process.

Snip oooooooo

Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents. }

[ralph - Embeddings are “vectors” which are data arrays in numeric form, and therefore can be processed by a neural net - LLM, LMM.]

Snip ooooooo

{ Tokens, which we explore in the next section, are the mechanism to represent text in vectors.

Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process. }

https://thenewstack.io/the-building-blocks-of-llms-vectors-tokens-and-embeddings/#:~:text=Tokens%20are%20the%20basic%20units,depending%20on%20the%20tokenization%20process.

A token might be a single character in a text. The number of tokens would rapidly increase.

In the context of video … A token might be a frame?
For a frame/ photo/ picture it might be subunits, or objects within the frame?

WRT audio, what would it be?

Now the reference to “$10 per Million tokens” begins to make sense as the “cost” to run a LLM.

I “think” I have enough understanding of what are “vectors, embeddings, and tokens” so that now I’ll better understand the YouTuber discussions.

ralph

Edit to add some more acronyms I’ve heard:

LVMs Large Vision Models.

VLA Vision Language Action Models.
This, IMO, is an LMM.

LBMs Large Behavior M.

VLA and LBM tie the AI to some “action” that results from the AI interpretation of “sensory inputs”.

rainphakir · April 30, 2024, 6:03pm

Here is my rudimentary understanding of Training vs Inferencing, as used in the discussions of LLMs, LMMs, FSD, AI Agents, ASI, etc.

Deep Learning Training vs. Inference: What’s the Difference?.

{ ## The Machine Learning Life Cycle

Machine learning works in two main phases: training and inference.

In the training phase, a developer feeds their model a curated dataset so that it can “learn” everything it needs to about the type of data it will analyze.

Then, in the inference phase, the model can make predictions based on live data to produce actionable results. }

Training is the process by which a “blank slate” AI AGENT acquires the data (knowledge) it needs in order to perform some “intelligent” task. It’s not yet doing any real “work”. But it’s ready.
Training is compute intensive, and is likely done in a data center.

Inferencing is the process by which an AI AGENT detects the environment and COMPARES those sensory inputs to its “trained” data, and then OUTPUTS some real “work”. Ie, the Agent OWNER gets some real world benefit from the “trained agent”.

Inferencing occurs at the “edge” where the AI interacts with its environment. And uses “edge compute” capabilities.
IoT Agents.

When a user accesses an LLM (or LMM? or FSD?), she is getting an already trained model. She then inputs her “personal, novel” data/information, to the trained model. The LLM then compared the novel data to its stored training… And INFERS a “correct” response.

Meta’s Llama 3 LLM has a “small” 7-8 billion parameter model as well as a larger 70B parameter model.

{ This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases.

Snip ooooooooo

Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. }

More at the link.

I’ve “heard” YouTubers say that the smaller 8B parameter Llama 3 LLM is designed for DIY and small business users who don’t need (or want) to use the larger LLMs.

The larger 70B parameter model is designed for users with larger budgets, technical support, and use needs.

Those YouTubers also suggest that “use and fine tuning” of the LLM at the user level, with the “edge” data, becomes part of the training set, and improves the LLM’s output.

I sorta intuited these concepts. The “data” I’ve acquired allows me to infer a better understanding of the YouTube discussions of the tech.

ralph

Edit: I’m using @Leap1. I’m unable to reply to my own posts, so I’m replying to Leap.

captainccs · April 30, 2024, 7:11pm

More simply:

At the datacenter the neural network is learning
The inference chip is executing

The Captain is writing

bjurasz · April 30, 2024, 7:54pm

It can occur at the edge. Sometimes is still done in the data center. When I interact with ChatGPT that is all done in their data center. The inferring of the answer to my query does not happen on my laptop.

Where inferring is done depends entirely on the size of the trained model.

rainphakir · April 30, 2024, 10:10pm

Thanks for the comments.

IIRC, the YouTubers have said Meta Llama 3 8B parameter model can run “locally - not connected to the Internet”, as long as the chipset/computer can handle it.

Tesla FSD runs on the car system with Hardware 3 or 4.

This video is an example of Inference?

And uses the SUNO AI V3 “trained” on songs / music from many genres?

Where is this Inferencing occuring?

Wes Roth mentions being prompted to pay some credits.

TIA for helping me better understand this tech!

ralph

Leap1 · April 30, 2024, 10:15pm

This is a sales scheme. The word ‘results’ is key. The improvement is not happening.

The relationship between closed and open systems, ie a roll of the dice versus the complexities of human speech or driving are too extremely different. The risk tools we use as human beings are totally different. Closed system games of chance are classical odds. Nothing in the open world of events and happenings(redundant perhaps) is anything like that. The machine can not compensate.

The relationship between the generic event and the specific happening are unbridgeable for the machine. Classes in computer code for objects do not work to bridge the problems.

rainphakir · January 22, 2025, 12:51am

Continuing the “vocabulary”.

First, back in DOS days, we had to boot with a boot disc, then insert 5 1/4 floppy and type “run” or “exec” commands in order to do some “work”.
On a greenish screen. We had to memorize some basic coding and use file trees to find our “files”.
Then the GUI was discovered, and we could click a link.

Compute began to move to the realm of “not a nerd” users.

Eventually, “mobile” happened, the “desktop methods” were deemed obsolete, and the infernal “app” links were forced on us.
But, still these were just incremental advances that let us “do some work” with fewer “clicks”. Programs were being “strung together” so that the next one, would automatically “run”.

Nothing intelligent about the program/s being “run”. Once started, it performed each step in a more or less linear fashion.
Non-nerds benefitted.

The App concept was more like a trusted employee who “added value” to the work being done before you saw it.

There was a massive proliferation of apps 10-15 years ago. A gold rush, if you will, and nerds made bank.
And, today, apps are commoditized.

You/we interact with your/our devices via apps.

Until now.
Today, we are increasingly interacting via AI Agents, or the Agentic Framework.

These “agents” are software bots. Somewhat autonomous lines of code that “run/execute” a complex series of complex “commands”.

Agentic platform/s are the next step in this evolution? IDK.

{The Rise of Intelligent Agents: Exploring the Levels of Agentic Platforms
LLM-powered Agentic Platforms could mark a significant advancement in the field of artificial intelligence, harnessing the power of Large Language Models (LLMs) to create highly sophisticated and interactive agents. These platforms enable the development of intelligent agents capable of understanding, reasoning, and engaging in natural conversations, providing accurate and context-aware assistance across a wide range of domains.

Agentic Platforms can be described through multiple levels, each building upon the capabilities of the previous one and introducing new features and functionalities.}

We holler at the platform (Gemini, Claude, Grok etc), tell it in a general way what we want, and the Agentic Platform activates (runs/executes) a bunch of bots that use some level of AI to deliver a final product.

Am I understanding what is an Agent and what is Agentic Framework?

ralph

flyerboys · January 22, 2025, 1:10am

XLNT summary.

But us truly ancient nerds would load a (carefully protected because absurdly fragile) paper tape into the primitive because so simple paper tape reader, and then “key” or “paddle” (meaning turning each of 8 paddle switches on the front frame of the computer) in the octal sequence of numbers making up the “machine code” needed to run the paper tape reader (which had one command: read tape, into RAM), and then load from the paper tape the “boot program”, a primitive Operating System that then could read the (huge floppy) disk that held the actual operating system.

I would get cross if someone desperately needed me to “wake up” the computer to do something or other, but would spill coffee on the paper tape while endlessly jabbering at me as I tried to remember the octal code sequence to paddle in….

I don’t even walk to talk about binary card readers……or mag tape based computers.

d fb

Topic		Replies	Views
Are AI Vendors Cheating? Macro Economic Trends and Risks	3	83	September 4, 2024
Everyone is getting into the LLM game now Macro Economic Trends and Risks	3	534	April 15, 2023
A rambling but interesting dialog about AI Macro Economic Trends and Risks	0	68	December 21, 2024
Large Language Models by Andrej Karpathy Macro Economic Trends and Risks	11	495	December 1, 2023
AI vs ML vs Deep learning Saul’s Investing Discussions	3	213	April 24, 2022

LMM Large Multimodal Model AI

Abstract

Related topics