Deepseek, some random thoughts

Deepseek, the new state of the art reasoning model, upended the silicon valley AI scene. Few young kids, who really wanted to develop a quant trading model for China, eventually ended up developing a new algorithm for LLM. Pretty impressive :slight_smile: With that some random thoughts, in no order.

  • We heard from NVDIA to every HPC how huge clusters (100K+) of GPU’s are required to train the model, and how big is their cluster etc.
    • Scale is an economic moat, not an innovation moat, if anything it works other way around, those who are deprived innovate furiously
  • US government’s attempt at export control etc, trying to slow down China actually failed
    • It is time to question the CHIPS act and now STARGATE
    • Trying to eliminate few 100 government employees thinking on your ideological ground and making a big fanfare that we are going to save money and yet we are spending 100’s of billions or cumulatively over a trillion and have nothing to show for, when will people question this?
  • The myth Chinese are copycat and cannot create original, innovative tech is broken
  • We have seen Elon cutting costs on rockets, India’s Mangalyaan (a space probe to Mars) on shoestring budget, repeatedly shows the old bloated enterprises cannot be justified with “american exceptionalism”.
    • This extends to Pentagon and most of the aerospace, defense companies. We need serious innovation and cost cutting here. US certainly doesn’t need to spend so much on defense.
10 Likes

Have you used it? Or are you just repeating what someone else has said?

1 Like

Played with it a bit. Reading the paper on Model. Of course I have listened/ read few folks I consider very smart on AI, that includes academic, professional, VC voices. But these are my own thoughts. For ex: on the moat, it occurred to me, having said/ written that, there is nothing to prove/ disprove that deepseek cannot get even better with a bigger fleet of GPU’s.

I am on the side of wait and see. I am not so sure they are all that great. If, as you say they can get even better with a bigger fleet of GPU’s. Well how well could they be doing now when X, Meta, and OpenAI already have those bigger fleets.

Here are some of the differences than the models used by US AI companies, just to be clear, I am cutting and pasting from someone, who is essentially building an open source LLM.

  • Use 8 bit instead of 32 bit floating point numbers, which gives massive memory savings
  • Compress the key-value indices which eat up much of the VRAM; they get 93% compression ratios
  • Do multi-token prediction instead of single-token prediction which effectively doubles inference speed
  • Mixture of Experts model decomposes a big model into small models that can run on consumer-grade GPUs

The above discusses the key differences that helps deepseek to use lesser memory and compute. Separately, I am paraphrasing my understanding here, could be technically incorrect,

deepseek, can stop and start their training model without completely restarting, i.e., if openai starts their training and realized they hallucination rate or failure rate is higher and need to make some adjustments to their model/ algorithm, they essentially have to restart the training and deepseek doesn’t have to. I don’t know about the model training deeply to understand why, or how deepseek is able to achieve this.

But this is big advantage.

4 Likes

It could be if true. I am just skeptical that deepseek is better than X,OpenAi, or Meta.

Why are you skeptical?

When you type a sentence in chatbot, first it needs to be converted to vectors, the way it is done is by converting one word at a time and the next word is added to the previous vector and fed to the encoder, similarly for the decoder. This is done one token (word) at a time. Deepseek for example can use multi-token, bringing in significant improvements.

This is one example. Again, remember they released their product as open-source. That means nothing stops Meta for ex, to take it and further fine-tune it, or adopt parts of their architecture into meta’s Llama, etc.

If you are skeptical that no-name chinese guys can achieve such a success, remember the VC model is always finding 2 or 3 extraordinary talented guys who can disrupt the existing product and deliver 10x improvements. The LLM deviated from this because in order to catch up to Google, the openAI’s of the world build large scale training infrastructure. There is nothing new about few smart individuals upending existing players. If you think they happens to be Chinese and therefore they cannot achieve this, then knowledge is not a monopoly of any region, country, religion, race…

1 Like

If any of this is true about DeepSeek I see this as a bubble bursting moment.

1 Like

Few data points.

  • DeepSeek is the #1 download in App store
  • Aravind is CEO of perplexity, and he is already committed to use deepseek in their product.
  • This is the more revealing data point, look for inference cost by existing AI players to come down in the coming week(s).
    • I have already posted this in the LL , how $BABA has slashed its price by 85%. When your competitors are competing with you on price, that means they are not providing a superior product than yours. At the least, your product is as good as theirs.
1 Like

I am Skeptical because OpenAI has been at this longer with more. If it was that easy to over take them then some other company in Silicon Valley would have done it. After all, you don’t think that only Chinese are hungry to get a head do you?

4 Likes

I thought we were talking about LLM’s not inferencing.

I asked " I am specifically looking at your ability to course correct even before you provide me the response question to both DeepSeek and chatGPT to understand how their architecture differs and works… I have pasted both responses in the below link. See for yourself.

Silicon Valley companies are not constrained by capital or access to compute (GPU’s), so they fight using traditional tools. When you fight against a much bigger, powerful army, what do you do? You use gorilla technic or innovate in how you engage your army. Now, that is the “necessity is the mother of invention”.

1 Like

Sounds to me like Chatgpt was doing that a year ago. I am not an expert and the only way we can actually compare the two is to have an expert compare them.

ChatGPT says…
my ability to dynamically adjust and course-correct during a conversation

What this means is ChatGPT requires further inputs… OTOH, DeepSeek can “course correct” as it generates the reference or during the inference space.

ChatGPT is a watershed moment in AI, So is DeepSeek. We will have many more. US companies Google, META, OpenAI, Anthropic,… all have deep pockets, deep talent. We will see…

2 Likes

Marc Andreessen of a16z.com says…

1 Like

I have no opinion on Deepseek. As a coder the first two items are spot on! When I started, computers were very limited and we were forced to optimize. By the time of the dot.com bust, hardware had become so abundant that George Gilder was saying, “Waste abundance with glee.” Up to neural network AI, most software was algorithmic. Neural network computing is an entirely different paradigm that relies on abundance instead of on boolean (human) logic (algorithms). Put another way, neural networks work like the universe does, everything is huge, made up of the tiniest of tiny.

Item three is multitasking, a good use of scarce resources.

Item four is not a long term solution. AI has to stand on its own two feet. Sorry for the mixed metaphor. :grin:

Despite Deepseek, size matters. Intelligence is an emergent property of complex systems we call brains. Bigger brains tend to be mote intelligent than smaller brains.

The Captain

3 Likes

All the Techs are down in pre-market trading.

intercst

1 Like

Actually this is very very sustainable. Currently models are trained on everything, and build one massive system. What DeepSeek did is build many expert systems and these expert systems are called when needed. Instead of 1.7 Trillion token active at once, they have only 37 B token active. You have a huge team, but call the experts only when needed. Also, you can add experts, fine-tune experts. Just to give an example, tomorrow you decided to add genome data, you can easily add without needing to completely retrain the model.

This is not only sustainable, expect this to become standard or widely adopted approach.

2 Likes