Suck it up (rather, suck it down)

I’m trying to suck down historical posts from the BRK board because no one at TMF answered yet if they would preserve old posts in an archive (if such a DIY archive is successful then one could do others boards, excuse me ‘categories’ and ‘tags’).

I’m using ‘sitesucker’ on a mac (but it’ll work for windows too), and so far it looks like about 5 gigs and still running. It’s apparently sucking relevant stuff as far as I can tell, but I’m surprised by the magnitude. You also have to be careful with these sucky things or else it’ll follow links and try to suck the whole internet.

If you google something like ‘how to download complete websites’ you’ll find various tools (including ‘wget’ for the Unix freaks, which I may try next).

But perhaps someone more adept than me could have a go and see if they get something reasonable?

3 Likes

Click through the search tool listed on MI FAQ. But nothing is going to be kept forever.

1 Like

OK, I think I sucked it all down, but it’s a mess of separate files.
However this is a ‘proof of principle’ that we can save the old historical posts (assuming that TMF won’t, and there is no indication at this point that they won’t let all those valuable discussions vanish into the ether).

RayVT aren’t you an old Unix/bash hand, or maybe there are others? I’m just stabbing in the dark.
I used:
wget -r -E -k -p -np -nc --random-wait https://discussion.fool.com/berkshire-hathaway-brk-a-101158
.aspx
and it apparently pulled down all posts back to the beginning of time, but as I mentioned above, it creates a directory that just a big mass of html files. If I open a random html file I do get all the info for that thread displayed as a web page (for whatever long ago date that thread happened to be) i.e. it looks like that long ago discussion was captured. But the files aren’t organized, or else I don’t know which file to open that makes them all organized, e.g. opening index.html didn’t seem to work. Perhaps the master organization html file is there but I don’t know which one it is, or maybe it’s not there and that wget need to be invoked differently?

Any wget experts out there?

4 Likes

Here’s a screen shot just to show that it is indeed capturing board info before the switch


It looks like the old board, and you can click around in it, just as if it was the old board.
But the only available navigation is to e.g. click ‘previous’, so that’s not a good way to navigate.

If we could point the old search engine to the directory where all these old posts are, then they would be easily navigable, e.g. if you wanted to find old discussions on, say, put selling, then you could find those.

I forget who implemented that search engine. If they’re reading this, or someone knows how to contact them, then I think with their assistance we may be able to have an archive of all old posts and search them with the engine iif the search engine is simply pointed at the directory containing all the downloaded posts.
IOW, all those years of valuable work and discussion won’t vaporize.

7 Likes

How big was the mess? ;-] And could you limit the DL to the mechanical investing forum only?
Thanks Ted
Brian

About 6 gig.
I started with Berkshire board as root. I haven’t wandered around much in the 6 gig, but as shown in the screen shot (at least some of) the original posts are there.

It sounded like Mark Willcox who wrote the datahelper.com search capability might do something if he gets time:

From: “Mark Willcox” willcox@datahelper.com[Edit Address Book]
To: “tiles-66505@mypacks.nettiles-66505@mypacks.net
Subject: Re: archive of old board posts
Date: Oct 9, 2022 2:27 PM

Hi, Ted

Yes, I added a search capability some 20 years ago.

http://www.datahelper.com/mi/search.phtml

Up to a couple days ago, I did not show the actual post, only linked back into TMF. I have started opening it up a bit and will do more when they abandon their history.

I don’t use the boards any more but I’ve heard that message volume is way down. Very sad to see. It wouldn’t be all that hard to set up an alternative message board of some sort but someone would have to take that bull by the horns and run with it.


4 Likes

Thank you Ted. It sure is nice of Mark to do that.

3 Likes