LMM Large Multimodal Model AI

We’ve all heard of LLM (Large Language Model) and ChatGPT3, 4.
And how that’s revolutionizing internet search and term papers.

Get ready for LMM - Large Multimodal Model.

{ A Large Multimodal Model (LMM) is an advanced type of artificial intelligence that can understand and generate content across multiple types of data, such as text, images, audio, and video. Think of it as a highly skilled artist who is not just a master painter but can also compose music, write captivating stories, and direct movies—all with a deep understanding of each medium’s nuances. }

Once this LMM technology matures … Lots of amazing conveniences will manifest?

Here’s Neil deGrasse Tyson POV.



…and that is why books were banned, in “Fahrenheit 451”, in “1984”…



Think of it as a highly skilled scientist. Nobel Prize winning theoretical physicist, Richard Feynman, said, “It starts with a guess…”

Then you apply the scientific method to the guess.

It is important to realize that AI and LLMs are highly creative but not necessarily accurate. Good ones have a high hit rate but seldom if ever perfect. Self driving will kill some people, hopefully fewer than human drivers kill.

The Captain


I’m wondering if LMMs will democratize autonomous driving, to the extent that Tesla will NOT be the ONLY “FSD”?

Nvidia has hardware and Cuda. And AI scientists.

Tesla went to neural nets and decreased the “code” dramatically.

Will LMM allow competitors to “easily” challenge Tesla FSD?


LLMs are one application of Neural Network based AI. FSD is another. Same basic technology. Since it’s a technology that mimics how humans learn there are only two reasons why competitors cannot match Tesla’s FSD, money to fund the required computing power, and data to feed it.

At this time Tesla seems to be second only to Meta in computing power and orders of magnitude ahead of all car companies in driving data. With legacy automakers losing money on EVs the moat is quite powerful for the time being. A self driving car to be really useful has to be able to communicate verbally. Elon’s X has the LLM that could take care of that to complement Tesla’s FSD.

The Captain

1 Like

LMMs encompass both LLM and other data inputs?

FSD trained on videos MIGHT be termed LVM (Large Video Model)?

And a LMM (L Multimodal M) would encompass both LLM and LVM and other modalities?

I’m wondering if at some point an ASI (Artificial Super Intelligence, the fearsome take-over-the-world AI entity that comes after AGI) using an LMM might be able to “learn to drive” by watching a few thousand videos (real world and simulated), reading the “how to” text. And taking the test.

No need for “billions and billions of miles of driving videos”.


Pilots use flight simulators to learn to fly X aerial vehicle.

16yo teens learn to drive by watching mom/dad, taking a course, maybe watching a couple hours of video… And taking a test gown at the DMV.

An LMM neural net seems capable of the same?

1 Like

You are making this much too complicated.

The problem with heuristics is the inability to cover all edge cases. Video simulations are effectively ‘heuristics’ because they are man made. The ‘billions and billions of miles of driving’ are an attempt to capture as many unusual edge cases as possible, stuff that most people can’t even imagine.

About flight simulators, I love Mentour Pilot. In the last video I watched, he said that the situation in the cockpit was never practiced in simulator training. Same problem, the training is heuristics based, not real world. After real world happens they add it to the training program.

The Captain

This is where randomness comes into play with these simulations, what can catch lots of edge cases, and can create scenarios people can’t even imagine. More importantly you’ll veer off into scenarios and edge cases that people would never actually do in real-world driving. In other words, you’ll create training data that people would never do on their own but are important to hit anyway.

We do this all the time in the verification of CPUs and GPUs. It’s called directed random simulation. You can’t verify today’s designs without it.

1 Like

As a developer a long time ago I was aware of the difficulty of testing and validating code with the variety of software and hardware the code could possibly encounter in real life. That was long before " directed random simulation."

Is this what you are talking about?


Directed Random Testing

Random testing can quickly generate many tests, is easy to implement, scales to large software applications, and reveals software errors. But it tends to generate many tests that are illegal or that exercise the same parts of the code as other tests, thus limiting its effectiveness. Directed random testing is a new approach to test generation that overcomes these limitations, by combining a bottom-up generation of tests with runtime guidance. A directed random test generator takes a collection of operations under test and generates new tests incrementally, by randomly selecting operations to apply and finding arguments from among previously-constructed tests. As soon as it generates a new test, the generator executes it, and the result determines whether the test is redundant, illegal, error-revealing, or useful for generating more tests. The technique outputs failing tests pointing to potential errors that should be corrected, and passing tests that can be used for regression testing. The thesis also contributes auxiliary techniques that post-process the generated tests, including a simplification technique that transforms a failing test into a smaller one that better isolates the cause of failure, and a branch-directed test generation technique that aims to increase the code coverage achieved by the set of generated tests.

Applied to 14 widely-used libraries (including the Java JDK and the core .NET framework libraries), directed random testing quickly reveals many serious, previously unknown errors in the libraries. And compared with other test generation tools (model checking, symbolic execution, and traditional random testing), it reveals more errors and achieves higher code coverage. In an industrial case study, a test team at Microsoft using the technique discovered in fifteen hours of human effort as many errors as they typically discover in a person-year of effort using other testing methods.


The Captain

1 Like

Heuristics is “coding every case”, edge or common, and everything in between? Yes?

Hundreds of thousands if not millions of lines of code? Yes?

Is that how LLMs, LMMs, and AGI, (and eventually ASI), work?

If so… Then that’s NOT what I’m envisioning.

I thought they are now based on neural nets.
And they are perhaps a really advanced AI “agent” .

Back to learning to drive.
A 16yo teen learns to drive by watching mom/dad, other drivers, takes the drivers Ed course, watches some videos, reads the text, and takes a test.

She has NOT seen every “edge” case, yet is given permission to drive.

Her neural net provides the flexibility to adapt her actual experience to the “new, edge case” experiences she encounters on her journey.

Isn’t this a power law thing?
She has 80% of the needed knowledge, and will acquire more as time goes by and she gets experience with more “edgy” cases?
The “law” says she has the basic knowledge required to be allowed to drive. So… She drives.

An ASI (neural net) using LMMs (Multimodal, neural nets) in my imagination can do as well.

Llama (Metas LLM

ralph. Yeah I’m thinking a little wider and farther out than most comments I read/ hear.


Yes. I mentioned heuristics to explain why simulations are not likely to cover lots of edge cases. I was not aware of “directed random simulation.” That might help but I have my doubts that it can get them all. In any case, after a time there will be enough real world data to make simulations unnecessary.

The 16yo teen has been watching lots of people driving for well over a decade, mom, dad, bus drivers, and lots of movies with cars.

Her neural net has the accumulated learning of having survived since birth, a huge advantage over FSD AI.

I don’t think so.

During her 16 years she has probably ridden a bicycle and other contraptions. Crossed roads. Seen how street lights and signs work. Whether she drives or not she already was a vast storage of useful learning. It’s not as she was an empty driving bucket at 16. Long before 16 I could identify cars by their tail lights. :slightly_smiling_face:

The Captain


Ralph silicon valley would never repackage things for sales purposes.

This must be the biggest break through since sliced bread.

1 Like

WRT running LLMs, (and LMMs), I keep hearing some $ amount per thousands or millions of “tokens”.

Tokens can refer to cryptocurrency units, but that didn’t seem to fit the LLM context.

In the LLM context here’s what I found:

{ When you are dealing with LLMs, you often come across the terms “vectors,” “tokens” and “embeddings.”

Snip ooooooooo

In mathematics and physics, a vector is an object that has both magnitude and direction.

Snip oooooooooo

In the realm of LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process.

Snip oooooooo

Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents. }

[ralph - Embeddings are “vectors” which are data arrays in numeric form, and therefore can be processed by a neural net - LLM, LMM.]

Snip ooooooo

{ Tokens, which we explore in the next section, are the mechanism to represent text in vectors.

Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process. }


A token might be a single character in a text. The number of tokens would rapidly increase.

In the context of video … A token might be a frame?
For a frame/ photo/ picture it might be subunits, or objects within the frame?

WRT audio, what would it be?

Now the reference to “$10 per Million tokens” begins to make sense as the “cost” to run a LLM.

I “think” I have enough understanding of what are “vectors, embeddings, and tokens” so that now I’ll better understand the YouTuber discussions.


Edit to add some more acronyms I’ve heard:

LVMs Large Vision Models.

VLA Vision Language Action Models.
This, IMO, is an LMM.

LBMs Large Behavior M.

VLA and LBM tie the AI to some “action” that results from the AI interpretation of “sensory inputs”.


Here is my rudimentary understanding of Training vs Inferencing, as used in the discussions of LLMs, LMMs, FSD, AI Agents, ASI, etc.

Deep Learning Training vs. Inference: What’s the Difference?.

{ ## The Machine Learning Life Cycle

Machine learning works in two main phases: training and inference.

In the training phase, a developer feeds their model a curated dataset so that it can “learn” everything it needs to about the type of data it will analyze.

Then, in the inference phase, the model can make predictions based on live data to produce actionable results. }

Training is the process by which a “blank slate” AI AGENT acquires the data (knowledge) it needs in order to perform some “intelligent” task. It’s not yet doing any real “work”. But it’s ready.
Training is compute intensive, and is likely done in a data center.

Inferencing is the process by which an AI AGENT detects the environment and COMPARES those sensory inputs to its “trained” data, and then OUTPUTS some real “work”. Ie, the Agent OWNER gets some real world benefit from the “trained agent”.

Inferencing occurs at the “edge” where the AI interacts with its environment. And uses “edge compute” capabilities.
IoT Agents.

When a user accesses an LLM (or LMM? or FSD?), she is getting an already trained model. She then inputs her “personal, novel” data/information, to the trained model. The LLM then compared the novel data to its stored training… And INFERS a “correct” response.

Meta’s Llama 3 LLM has a “small” 7-8 billion parameter model as well as a larger 70B parameter model.

{ This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases.

Snip ooooooooo

Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. }

More at the link.

I’ve “heard” YouTubers say that the smaller 8B parameter Llama 3 LLM is designed for DIY and small business users who don’t need (or want) to use the larger LLMs.

The larger 70B parameter model is designed for users with larger budgets, technical support, and use needs.

Those YouTubers also suggest that “use and fine tuning” of the LLM at the user level, with the “edge” data, becomes part of the training set, and improves the LLM’s output.

I sorta intuited these concepts. The “data” I’ve acquired allows me to infer a better understanding of the YouTube discussions of the tech.


Edit: I’m using @Leap1. I’m unable to reply to my own posts, so I’m replying to Leap.


More simply:

At the datacenter the neural network is learning
The inference chip is executing

The Captain is writing

1 Like

It can occur at the edge. Sometimes is still done in the data center. When I interact with ChatGPT that is all done in their data center. The inferring of the answer to my query does not happen on my laptop.

Where inferring is done depends entirely on the size of the trained model.

1 Like

Thanks for the comments. :slightly_smiling_face:

IIRC, the YouTubers have said Meta Llama 3 8B parameter model can run “locally - not connected to the Internet”, as long as the chipset/computer can handle it.

Tesla FSD runs on the car system with Hardware 3 or 4.

This video is an example of Inference?

And uses the SUNO AI V3 “trained” on songs / music from many genres?

Where is this Inferencing occuring?

Wes Roth mentions being prompted to pay some credits.

TIA for helping me better understand this tech!

This is a sales scheme. The word ‘results’ is key. The improvement is not happening.

The relationship between closed and open systems, ie a roll of the dice versus the complexities of human speech or driving are too extremely different. The risk tools we use as human beings are totally different. Closed system games of chance are classical odds. Nothing in the open world of events and happenings(redundant perhaps) is anything like that. The machine can not compensate.

The relationship between the generic event and the specific happening are unbridgeable for the machine. Classes in computer code for objects do not work to bridge the problems.

1 Like