That isn’t where the problem lies. The issue is memory bandwidth, or more to the point, how to manage caching so that you don’t throw away more performance than you gain. Let me put up a strawman, then how you can improve it. You have two chips, both of which read the same data from memory. For balance each chip gets alternate priority in requesting cache lines. If a chip skips its turn, the other chip can put in a request. The timing on this is easy. Reading a cache line will probably take say, two memory bus clock cycles, which will be dozens of chip clocks. Two clocks is optimistic, with a 256 or 512 bit wide memory bus (from graphics RAM). Remember the memory returned will take multiple clocks since you can’t stall the bus waiting for the data. It will come back when it is ready–quickly if from the CPU caches, slower if from main memory. So you need the incoming data to be tagged with its address.
Anyway we now have a merged stream of requests out, and a merged stream of data in. Is it better to have data only go to the requesting chip, or have both chips accept the data? Probably tag the data with a thread or process id, and if both chips are running the same process, the data will go to both. Think of a game running along with an operating system display, or a desktop. The game will run on both chips, the other processes (that have screens) on just one chip.
If all the data is read only, we are done. A lot of game data will be read only, like textures. But we need to be able to write, sometimes to main memory, but most importantly to one or more display buffers. (Yes, triple buffering is a thing, and if your refresh rate is high enough, very nice, no rips or tears, etc.) A win here is to have display data to be write only. You can put the screen buffer on one chip, but that means a lot of data streaming from one chip to the other. Or you can get fancy and have each chip responsible for half the screen. Not always a win, but splitting the screen into tiles is something that nVidia pioneered recently. If you do that, you can have the screen buffers in main memory. (Or in what the chips think of as main memory, the data will go to the CPU and get stored wherever the memory management puts it.)
I’m not going to add up what requires extra silicon and I/O pins/connections. The extra silicon required isn’t too bad. It is those nasty I/Os. You will need a hundred or more pins just for the connection between the two pins. It should be possible to put in a switch to turn the extra stuff off, but it is probably easier to use wired logic (instead of transistors) to turn the extra bits off. May take a few pins if you want to do it off chip, or some laser cuts to make a single chip variant. Probably the way to go, as other laser cuts will be used to scavenge defective chips as lower performance versions.