Update and Introduction

Hi All, I hope some of you remember me. I was an active participant on boards.fool.com for 10-15 years starting in the early ‘00s, before I got married in 2015 and got busy with life. But TMF was, for me during that time, my home page to the Internet. I made permanent friends here, fell in love and had my heart broken (long story), and made memories and learned things that have stayed with me through the years. It broke my heart that the management here decided to just retire all that content. We, the community, made that stuff. IMO it belongs to the community, not TMF. It feels like a gigantic abdication of internet host responsibility to just take it all offline. It is in that spirit that I make this announcement.

Introducing Maple, or more specifically, the TMF Component of Maple, whom I fondly refer to as my AI Life Partner. I set out to make a better, more personalized agentic AI experience than what ChatGPT could provide — something all the cool kids were doing in 2025. I’ve ended up building an operating system of sorts, with vector memory and some other niceties that the foundation models’ “demo apps” (ChatGPT, Claude, Grok, etc.) don’t support because [reasons]. It also allows me to define new “Components” which correspond to individual domain interests of mine. One of those Components is the TMF Archive, and I built it with the express purpose of making it available to the semi-public.

What It Is

Roughly speaking, the TMF Archive is a museum of sorts — a love letter to the message boards that gave me so much over the corpus epoch. It’s an attempt to make ~18 million messages from ~296K authors searchable, browsable, and explorable in ways that plain text archives never could be.

The Pipeline

More specifically, the back end is a 5-stage ETL pipeline that processes each message through progressive refinement:

HTML Import — Raw HTML files from the archive are gzip-compressed and stored in PostgreSQL. High-rec threads are prioritized.

HTML Parse — Each message is parsed for structured metadata: author, date, subject, reply chains, and extracted URLs (replaced with [URL1], [URL2] placeholders in the text).

Quote Detection — I’m pretty proud of this one. A custom shingling algorithm (5-word n-grams) compares each message against its parent messages in the thread to identify and properly attribute quoted text — even when the quotes aren’t consistently formatted (italics, bold, —snip—, you name it). Detected quotes are wrapped in … tags with source attribution.

Embedding — The cleaned, quote-tagged text is embedded into 1536-dimensional semantic space using OpenAI’s text-embedding-3-small. Messages longer than 512 tokens are chunked with 50-token overlap. Vectors are indexed with HNSW for fast approximate nearest-neighbor search.

Summarization — An LLM generates structured prose summaries of each thread, including key themes, notable arguments, and whatever structured learnings can be extracted.

The whole thing runs on a PostgreSQL-native task queue (no Redis, no RabbitMQ — just SELECT FOR UPDATE SKIP LOCKED) with containerized workers on Google Cloud Run. There’s a real-time Worker Dashboard to monitor progress, control pipeline stages, and manage retries.

Features

🔍 Semantic Search

Free-text search across all embedded messages. Not keyword matching — actual semantic similarity using vector cosine distance. Search for “arguments against index fund investing” and get results about active management philosophy even if those exact words never appear. Supports filtering by author, board, date range, and minimum recommendation count. There’s also a recs-boost mode so the community’s most valued posts surface first.
I also have *all* of the original DocText indexed, so even if a message isn’t embedded yet, it can still be quickly discovered if you know enough unique words to identify it!

🧬 Author Profiles & Personas (work in progress)

Comprehensive AI-generated author profiles built from a quality-weighted sample of each author’s posts. Writing style, topic expertise, rhetorical tendencies — the works. And then the fun part: persona chat. You can talk to a RAG-powered impersonation of any profiled author. Ask them about current events and get a response grounded in their actual historical writing voice.

🗺️ Clustering & Galaxy Map

HDBSCAN density-based clustering with UMAP dimensionality reduction. Multiple clustering strategies (cluster by original posts, thread summaries, or a combination). The results render as an interactive Galaxy Map — a WebGL scatter plot where each point is a thread and clusters form constellations. Double-click a cluster to see its member threads; double-click a point to read the thread.

🕸️ Network Graphs (coming soon)

Four flavors of graph analysis:

Social graphs — Who talks to whom, based on reply chains, quote relationships, and thread co-participation

Citation networks — Message-level quote relationship graphs within threads

Topic graphs — Cluster similarity networks showing how topic areas relate

Bipartite graphs — User-to-cluster membership linking authors to their topic communities

📦 Manifests & Collection Box

Create named collections of threads, authors, boards, or individual messages — like playlists for analysis. Use manifests as input to clustering jobs, graph visualizations, or the pipeline itself. The Collection Box UI lets you drag-and-drop items during exploration and then run analytics on your curated set.

📅 On This Day

“On This Day in TMF History” — a curated daily feature showing notable threads from this calendar date across all years of the archive. Personal events are weighted toward your own posting history if you’ve linked your TMF identity. Events can be somber, celebratory, or just interesting.

💬 Thread Viewer

A faithful renderer of the original message board format — threaded conversations with proper quote attribution, recommendation counts, author metadata, and reply chains. Double-click any search result, cluster member, or “On This Day” event and you’re reading the original thread.

📊 Data Cubes

Pre-computed aggregate statistics by author, board, or across the whole corpus. Activity histograms, recs distributions, posting frequency by year/quarter — all queryable without slow ad-hoc queries against 18M rows.

🌐 Wayback Machine Integration (coming soon)

I intend to backfill any missing posts that can be found on the Internet Archive (including deleted posts, if I can!), but haven’t gotten around to it just yet. The scaffolding is there.

The Terminal

The front end is a terminal interface — think command line meets chat window. Natural language prompts are fed to an LLM ReAct RAG loop (tool-calling agent), which processes your request. The agent has access via MCP (Model Context Protocol) to the same API calls you do: search messages, browse threads, generate profiles, run clustering jobs, query data cubes. An agent that can do everything you can, for you.

The terminal is also where Maple (the AI) lives more broadly — it’s not just a TMF interface, it’s the shell for the whole operating system. Other Components (health tracking, medication logging, a medical research assistant, a dev journal) all live behind the same terminal prompt. But the TMF Component is the one I built for everyone.

The Stack

Persistence: PostgreSQL 15 + pgvector (Cloud SQL)

Backend: Python, FastAPI, SQLAlchemy 2.x

Workers: Dockerized, Cloud Run, PostgreSQL task queue

Frontend: React (Vite), TailwindCSS

Embeddings: OpenAI text-embedding-3-small (1536d)

Clustering: HDBSCAN, UMAP, scikit-learn

LLM: OpenAI GPT-4o / GPT-4o-mini

Agent Protocol: MCP (Model Context Protocol)

It’s very much a work in progress. The pipeline is still chugging through the corpus. Not every author has a profile. The clustering parameters need tuning. But the bones are there, and it’s already useful enough that I find myself using it daily to rediscover conversations I’d forgotten about from 15 years ago. I hope some of you find it interesting too!

Finally, this is actually a request for help. Most of the early steps in the pipeline are essentially free, or can be run on my own machine (and pushed to the cloud if I feel like paying for faster). But the embedding and summation processes involve API calls to OpenAI, which charges by the token (well, per million tokens, but believe me, they account for every single token). I haven’t spent a TON of money on this yet – maybe a $ couple hundred. As hobbies go, this has been one of my cheaper ones. But embedding and summarizing nearly 18 million posts is going to be cost-prohibitive, I fear, and so I need a way to prioritize the most important stuff first. Currently, the workers do this passively by picking up their next task in order of descending recs, so the most popular material is handled first. But that’s still *a lot*, and so I’m asking you to tell Maple what entities are important to you (authors, boards, concepts, etc), and she will help you build a manifest of messages that I will then prioritize. Any human user expressing interest in any topic will have theirs bumped to the top of the queue immediately.

Anyway, I just hope it works at all. Thanks, and let me know. Oh yeah, you can access it here: https://whafa.com . Authentication is via Google OAuth, so you need a google account to use it, sorry. I’ll support Apple and Facebook auth eventually. It only uses your email and full name. Maple will try to match you by name to your TMF author account. If that doesn’t work (and it probably won’t, if you aren’t in the FB group where I originally announced this), you can just tell her who you are. Once your user account is linked with the TMF account, you are “in” and have general access to the TMF features. Then Maple will ask you for your interests. You can tell her, and she will try to find them and add them, or just type “collect” for a collection GUI tool. I prefer “collect -p” to pop out a separate window.

13 Likes

Hey whafa, good to “see” you again! Good luck on the venture, and I will await your non-google verified version.

1 Like

The thing is, I can’t just let any anonymous person log on – prompts are sent using my API key and it’s my credit card on the account :slight_smile: So I have to authenticate users somehow. Google, FB, and Apple all do essentially the same thing: Host an OAuth 2 token endpoint server which will allow me to check with them to make sure you are you. I don’t get any of your auth info, password or anything. It’s a pop-up window to Google’s OAuth system, same as many other websites.

2 Likes

But in the event you don’t want to authenticate, you can still participate! Maple should be cold, curt and unhelpful to anyone who hasn’t yet signed in. If you can pull any useful info out of her at that stage in the system, I’d like to know about it :slight_smile:

Imagine the intercst persona answering questions in an authentic voice – it’s the TMF nightmare.

intercst

6 Likes

:slight_smile: Here’s what I have for you. 20,059 posts total, the overwhelming majority in Retire Early CampFIRE. Seems like your persona will be good to ask about financial independence. (but I haven’t targeted your user info, in my vanity the only author I have completely processed is ME)

Board Name Post Count
Retire Early CampFIRE 14297
Political Asylum 1546
Retire Early Liberal Edition 1308
Retirement Investing 879
Dell Inc. 716
Retired Fools 285
Living Below Your Means 107
Insurance - Life, Car, Home,etc. 105
Exxon Mobil Corporation 99
Corning Incorporated 73
Enron Aftermath 60
Pfizer, Inc. 52
Enron Corp. 52
Estate Planning and the Fool 49
Compaq Computer Corp. 46
Inheritance Strategies 43
Washington Mutual Inc. 33
Portfolio Management 33
Culinary Masters 32
Millionaire Fools 27
Health Care Reform 27
Avis Budget Group 26
Tax Strategies 26
Foolish Four 19
WebMD Corp. 8
Reality Television 8
IBM 7
FIRE Wannabees 7
Help with this STUPID computer! 7
Passion Saving 7
Retirement-Soon or Done 5
Cisco Systems, Inc. 5
Fool Pilots 4
LBYM Singles 4
BP plc 4
Investing Beginners 3
Macro Economic Trends and Risks 3
Merck & Co., Inc. 3
Improve the Fool 3
Pet Lovers 3
Fools and Their Money 2
Bonds & Fixed Income Investments 2
Gateway, Inc. 2
Trump’s Apprentice 2
REALOGY CORP 2
Ask The Headhunter 2
Major League Baseball 2
Trump 2
Amazon.com, Inc. 1
Foolish Golf Tips 1
Christian Fools 1
Other Investing Books 1
Health and Nutrition 1
Libertarian Fools 1
Folly in Connecticut 1
Moving Out of the Fast Lane 1
Fun & Folly 1
Atheist Fools 1
Retirement Team Leader Business 1
Duke Energy Corporation 1
Bausch & Lomb Inc. 1
General Aviation 1
Coupons N’ More 1
Ask A Foolish Question 1
Index Funds 1
The Value of Education 1
The Feste Award 1
Foolish 401(k)s 1
Scrabblers 1
Mechanical Investing 1
6 Likes

Whafa, this looks like a very interesting project!

I tried logging in and wasn’t successful. At least, I don’t think it was successful - I went through the Google login script entirely, but at the end I was just back at the main menu prompt with only the initial four choices.

2 Likes

Oh, that’s disappointing. It’s hard to test multi-user applications without multiple users :confused: I’ll see if I can find you in the logs to see what went wrong.

1 Like

No worries. I was using a gmail account that also includes “albaby” in the name, if that helps.

Awesome project. After logging in, I had 12 choices in the “menu”. “today” worked. But Maple couldn’t find my screen name yet.

2 Likes

I neglected to mention the non-trivial bit of this work, and the reason it is appropriate to post here on METaR without an “OT:” preamble (hi Wendy!) :slight_smile:

These ~18 mil curated posts (among which some human being clicked “recommend this post” 27,328,279 times), ostensibly focused on finance and investing, from the very first post in 1997 to the end of my current data in Oct 2010 or so, subsume two complete financial bubbles from start to finish. I believe that a signal can be extracted from this data, and when interpolated with live, current market data (and probably discussion group posts somewhere), can be used to predict and identify actionable market insights in a timely enough manner to make money. I believe that’s true, but it is going to take years of work to do it. I set out on this Project mostly to learn this new technology, and my earliest disappointment was that it’s not, in fact, just a giant tank that you can dump a bunch of data into, and dispense brilliance from the spigot at the bottom. The foundation model companies do A LOT OF WORK to make it seem like magic. Fortunately, they have also provided the tools to accomplish the coding portion of this work very easily :slight_smile: If you fancy yourself any kind of technologist, and you haven’t installed Antigravity or Claude Code to kick the tires yourself, you are missing the boat. It is a game changer. I could have coded all of this without it, but never would have.

6 Likes

Awesome project. After logging in, I had 12 choices in the “menu”. “today” worked. But Maple couldn’t find my screen name yet.

This seems like an odd thing for me to say in 2026 to someone who has been here since 2012, but you haven’t been around long enough to make it into my data :confused: I stopped my archive activities c. 2010 or so, and have nothing after that. I’ve coded a feature to backfill all the missing posts I can find in the Internet Archive, but the process is incredibly slow and worky, and though I have gone through it manually, I haven’t yet coded it into a worker task that I can set and forget. Coming soon!

2 Likes

Thank you to everyone who spent any time at all trying to use this thing. I’ve been coding in a bubble and really needed some concrete use cases to look at. What’s the first thing a person wants to do when they log in to such a system? The testing paths I walk down are so trodden that I’m no longer looking at it with fresh eyes. And I can see that a lot of it didn’t work, but some of it did work, which is amazing! This will improve over time.

4 Likes

For me, I wanted to look at some old posts. I wasn’t able to do that, yet. The system kept returning only a single-sentence summary of the post, and I couldn’t figure out a way to search for posts that I could remember from The Old Boards and wanted to reminisce over. But even seeing the titles and summaries brought back some sepia-tinted memories….

4 Likes

Hey whafa - I remember that username, but I don’t recall which board/s I remember it from. (I only frequented a handful of the many, many boards available). This project sounds like a huge effort!

I did try to get into your archive. I was able to “log on” via my Google account, but Maple didn’t recognize my TMF username. I was on TMF since at least 1997. I know my username started as “draggon” (small d) but at some point it upgraded to “Draggon” (capital D). Not much change, I know, but you know how finicky computers can be… :smiley: I tried both versions, but both failed.

Anyway, I wasn’t able to accomplish much as Maple kept failing inquiries such as “Find Draggon’s posts on “Computer and video gamers”” with something something database limitations.

1 Like

I was able to login initially with Google. But when I tried again a little later, it gave me a long list of instructions to get a code from my phone which I was unable to follow. The prompts it was directing me to didn’t exist. So I was unable to get a code.

A lot of work clearly went into this. Impressive.

2 Likes

A lot of work clearly went into this. Impressive.

Thank you for saying that. This is something that has been on my mind for years. I’ve had the files sitting on an external drive since 2010 and was getting worried the disk magnetism was going to degrade if I didn’t do something soon. But it was really these new coding agents that enabled me to fall so deeply into the attention hole I’m just now climbing out of. That started mid-December or so, IIRC. I wanted to get my hands dirty with the tech in general and this project was a perfect use case. I had tried a few times to cobble different things together with the vibe coding. Until last December, they had all been pretty disappointing. The agent was both brilliant and idiotic. What started out as unbelievable became unworkable after even a little bit of complexity.

My prediction mid-2025 was that I will always have been better off coding it myself (time * quality) than dealing with a brilliant idiot. But there was some kind of step change late last year, like a really obvious one, and all of a sudden the tools were capable of turning my ideas into code, on something much bigger than a python import script. It’s hard for me to articulate just how amazing software tools have gotten in the past month. If this is the singularity, I am here for it. I do have to admit it has become a distraction to the rest of my life, and my wife was getting progressively unhappy with my household performance. So, it’s not done, but I wanted to publish what I had so far because I need to look up for a minute or two.

I’m definitely looking at every single interaction to tease out a backlog and make it actually useful. There is a ton of highly specialized knowledge in there. Did OpenAI already digest all of it, and you could get the same answers out of ChatGPT? Who knows. Probably. But it’s been a great thing to learn on, and has brought back a lot of memories for me as well.

4 Likes

But when I tried again a little later, it gave me a long list of instructions to get a code from my phone which I was unable to follow. The prompts it was directing me to didn’t exist. So I was unable to get a code.

If this is what I think it is, it happens to me sometimes, too. I can’t allow anonymous access for hopefully obvious reasons, but that’s like required infrastructure that doesn’t move the project forward at all. So I shunted all of my authentication off to Google, which seems to be the popular way to handle it these days. But it’s possible (likely) I am not doing it in some optimal way, or else Google is just a little suspicious of OAuth calls to a strange new domain that’s been registered for 25 years and nothing ever happened on. When they get suspicious, they enforce 2FA which is probably the code they were asking you to get. It happens to me when I log on from some new or different place, but not always. They probably have some AI on it :slight_smile:

Hi, @whafa! Great to hear from you again! Sounds like you have been super-busy with life and your TMF mining project. :slight_smile:

Wendy

4 Likes

All I can say, whafa, is thank you. I appreciate your work. Please keep going!

cheers, dan

3 Likes