Hi All, I hope some of you remember me. I was an active participant on boards.fool.com for 10-15 years starting in the early ‘00s, before I got married in 2015 and got busy with life. But TMF was, for me during that time, my home page to the Internet. I made permanent friends here, fell in love and had my heart broken (long story), and made memories and learned things that have stayed with me through the years. It broke my heart that the management here decided to just retire all that content. We, the community, made that stuff. IMO it belongs to the community, not TMF. It feels like a gigantic abdication of internet host responsibility to just take it all offline. It is in that spirit that I make this announcement.
Introducing Maple, or more specifically, the TMF Component of Maple, whom I fondly refer to as my AI Life Partner. I set out to make a better, more personalized agentic AI experience than what ChatGPT could provide — something all the cool kids were doing in 2025. I’ve ended up building an operating system of sorts, with vector memory and some other niceties that the foundation models’ “demo apps” (ChatGPT, Claude, Grok, etc.) don’t support because [reasons]. It also allows me to define new “Components” which correspond to individual domain interests of mine. One of those Components is the TMF Archive, and I built it with the express purpose of making it available to the semi-public.
What It Is
Roughly speaking, the TMF Archive is a museum of sorts — a love letter to the message boards that gave me so much over the corpus epoch. It’s an attempt to make ~18 million messages from ~296K authors searchable, browsable, and explorable in ways that plain text archives never could be.
The Pipeline
More specifically, the back end is a 5-stage ETL pipeline that processes each message through progressive refinement:
HTML Import — Raw HTML files from the archive are gzip-compressed and stored in PostgreSQL. High-rec threads are prioritized.
HTML Parse — Each message is parsed for structured metadata: author, date, subject, reply chains, and extracted URLs (replaced with [URL1], [URL2] placeholders in the text).
Quote Detection — I’m pretty proud of this one. A custom shingling algorithm (5-word n-grams) compares each message against its parent messages in the thread to identify and properly attribute quoted text — even when the quotes aren’t consistently formatted (italics, bold, —snip—, you name it). Detected quotes are wrapped in … tags with source attribution.
Embedding — The cleaned, quote-tagged text is embedded into 1536-dimensional semantic space using OpenAI’s text-embedding-3-small. Messages longer than 512 tokens are chunked with 50-token overlap. Vectors are indexed with HNSW for fast approximate nearest-neighbor search.
Summarization — An LLM generates structured prose summaries of each thread, including key themes, notable arguments, and whatever structured learnings can be extracted.
The whole thing runs on a PostgreSQL-native task queue (no Redis, no RabbitMQ — just SELECT FOR UPDATE SKIP LOCKED) with containerized workers on Google Cloud Run. There’s a real-time Worker Dashboard to monitor progress, control pipeline stages, and manage retries.
Features
Semantic Search
Free-text search across all embedded messages. Not keyword matching — actual semantic similarity using vector cosine distance. Search for “arguments against index fund investing” and get results about active management philosophy even if those exact words never appear. Supports filtering by author, board, date range, and minimum recommendation count. There’s also a recs-boost mode so the community’s most valued posts surface first.
I also have *all* of the original DocText indexed, so even if a message isn’t embedded yet, it can still be quickly discovered if you know enough unique words to identify it!
Author Profiles & Personas (work in progress)
Comprehensive AI-generated author profiles built from a quality-weighted sample of each author’s posts. Writing style, topic expertise, rhetorical tendencies — the works. And then the fun part: persona chat. You can talk to a RAG-powered impersonation of any profiled author. Ask them about current events and get a response grounded in their actual historical writing voice.
Clustering & Galaxy Map
HDBSCAN density-based clustering with UMAP dimensionality reduction. Multiple clustering strategies (cluster by original posts, thread summaries, or a combination). The results render as an interactive Galaxy Map — a WebGL scatter plot where each point is a thread and clusters form constellations. Double-click a cluster to see its member threads; double-click a point to read the thread.
Network Graphs (coming soon)
Four flavors of graph analysis:
Social graphs — Who talks to whom, based on reply chains, quote relationships, and thread co-participation
Citation networks — Message-level quote relationship graphs within threads
Topic graphs — Cluster similarity networks showing how topic areas relate
Bipartite graphs — User-to-cluster membership linking authors to their topic communities
Manifests & Collection Box
Create named collections of threads, authors, boards, or individual messages — like playlists for analysis. Use manifests as input to clustering jobs, graph visualizations, or the pipeline itself. The Collection Box UI lets you drag-and-drop items during exploration and then run analytics on your curated set.
On This Day
“On This Day in TMF History” — a curated daily feature showing notable threads from this calendar date across all years of the archive. Personal events are weighted toward your own posting history if you’ve linked your TMF identity. Events can be somber, celebratory, or just interesting.
Thread Viewer
A faithful renderer of the original message board format — threaded conversations with proper quote attribution, recommendation counts, author metadata, and reply chains. Double-click any search result, cluster member, or “On This Day” event and you’re reading the original thread.
Data Cubes
Pre-computed aggregate statistics by author, board, or across the whole corpus. Activity histograms, recs distributions, posting frequency by year/quarter — all queryable without slow ad-hoc queries against 18M rows.
Wayback Machine Integration (coming soon)
I intend to backfill any missing posts that can be found on the Internet Archive (including deleted posts, if I can!), but haven’t gotten around to it just yet. The scaffolding is there.
The Terminal
The front end is a terminal interface — think command line meets chat window. Natural language prompts are fed to an LLM ReAct RAG loop (tool-calling agent), which processes your request. The agent has access via MCP (Model Context Protocol) to the same API calls you do: search messages, browse threads, generate profiles, run clustering jobs, query data cubes. An agent that can do everything you can, for you.
The terminal is also where Maple (the AI) lives more broadly — it’s not just a TMF interface, it’s the shell for the whole operating system. Other Components (health tracking, medication logging, a medical research assistant, a dev journal) all live behind the same terminal prompt. But the TMF Component is the one I built for everyone.
The Stack
Persistence: PostgreSQL 15 + pgvector (Cloud SQL)
Backend: Python, FastAPI, SQLAlchemy 2.x
Workers: Dockerized, Cloud Run, PostgreSQL task queue
Frontend: React (Vite), TailwindCSS
Embeddings: OpenAI text-embedding-3-small (1536d)
Clustering: HDBSCAN, UMAP, scikit-learn
LLM: OpenAI GPT-4o / GPT-4o-mini
Agent Protocol: MCP (Model Context Protocol)
It’s very much a work in progress. The pipeline is still chugging through the corpus. Not every author has a profile. The clustering parameters need tuning. But the bones are there, and it’s already useful enough that I find myself using it daily to rediscover conversations I’d forgotten about from 15 years ago. I hope some of you find it interesting too!
Finally, this is actually a request for help. Most of the early steps in the pipeline are essentially free, or can be run on my own machine (and pushed to the cloud if I feel like paying for faster). But the embedding and summation processes involve API calls to OpenAI, which charges by the token (well, per million tokens, but believe me, they account for every single token). I haven’t spent a TON of money on this yet – maybe a $ couple hundred. As hobbies go, this has been one of my cheaper ones. But embedding and summarizing nearly 18 million posts is going to be cost-prohibitive, I fear, and so I need a way to prioritize the most important stuff first. Currently, the workers do this passively by picking up their next task in order of descending recs, so the most popular material is handled first. But that’s still *a lot*, and so I’m asking you to tell Maple what entities are important to you (authors, boards, concepts, etc), and she will help you build a manifest of messages that I will then prioritize. Any human user expressing interest in any topic will have theirs bumped to the top of the queue immediately.
Anyway, I just hope it works at all. Thanks, and let me know. Oh yeah, you can access it here: https://whafa.com . Authentication is via Google OAuth, so you need a google account to use it, sorry. I’ll support Apple and Facebook auth eventually. It only uses your email and full name. Maple will try to match you by name to your TMF author account. If that doesn’t work (and it probably won’t, if you aren’t in the FB group where I originally announced this), you can just tell her who you are. Once your user account is linked with the TMF account, you are “in” and have general access to the TMF features. Then Maple will ask you for your interests. You can tell her, and she will try to find them and add them, or just type “collect” for a collection GUI tool. I prefer “collect -p” to pop out a separate window.

