^^^ How long did that take to write?
So essentially what you are saying is that while it is possible it'll never truly be exact? in other words meaning one side of the screen(in a half by half render portion from the gpus) will be behind the other half because one gpu finished before the other?? And overall complex but great gains(like a high risk(complexity) high reward)?
Well, I type fast. I had to edit a couple of typos, but it's a stream of consciousness post.
Basically you have it. Yet, I may have left confusion about the two approaches.
The one where we can "stack" memory, division is not by the display (2D), it's by scene content (think, front to back, or walls vs characters, not the display left/right - at least not strictly).
In the version where we divide the display (2D), we can't "stack" RAM. Each card still has the same scene content, but each GPU renders a different portion of the 2D output. This latter version has a much greater likelihood of good synchronization between the GPU's.
I think you've explained this in a rather odd way, you seem to repeat yourself several times?
I agree what you say about the RAM - they are in different physical places so you may need to duplicate memory. But are we talking about forward rendering only here?
Well, it was a stream of consciousness post, with a "intro/re-examine" layout.
This isn't limited to forward rendering, it can be any mixed model. The choice of forward or not forward is really up to the GPU resources and the complexity of lighting/materials, and most engines mix whatever is required for the effects involved. Although, you're query does recognize there are difficulties.
For others interested, forward rendering is the older style. The 3D pipeline is usually fixed when forward rendering is exclusively used. In deferred rendering, sometimes called deferred shading (and this is getting more complex as the hardware improves), the pipeline is programmable, and the stage where fragments are processed (where each pixel is evaluated) is processed in a later stage, separated (or deferred) until later, compared to the forward rendering model.
This causes some technical problems with output buffering, hence SPS's inquiry about limiting this to forward rendering. When complex lighting, special effects and transparencies are involved, the content of the output buffer must be read (so it can be attenuated by color contribution from light/shadow/transparency), then written back. When the two GPU's are processing different 3D geometry, it would seem this is not possible without some way to synchronize an output buffer from two cards, with two distinct memory spaces.
This is yet another reason that tiled or banded output, dividing the 2D space of the display, not the scene content, is really more performance oriented, while "stacking" RAM would be less about performance and more about extending 3D scene content. In reality, I don't think 3D RAM stacking will be a popular paradigm. It may be faster to market, and more equitable, to simply wait for larger RAM cards.
Yet, there are hardware solutions. They are locked behind opaque doors, and we can only glean information about what MIGHT be a particular design approach based on the research on the subject. It appears one of a few solutions are probably in the works at the hardware level, but that means new cards would be required to support the solution.
I don't expect existing cards will be able to fully deploy these new concepts. The style where 2D display output is tiled or banded is more likely supportable in older cards than RAM "stacking".
This is central to SPS's inquiry. Existing cards might be limited to forward only rendering, which can hinder throughput, if they were to employ RAM "stacking". Yet, to support tiled or banded output, the method where hardware doesn't solve the problem of resolving tiles into single display buffer could mean software oriented copying and bus traffic at each frame.
GPU's are a client/server model. The CPU is the client, the GPU is the server. At each frame, the CPU decides what is on display and drives the 3D pipeline with commands. These commands, and any data which must be supplied or altered to perform them, are sent over the PCIe bus. It's a bottleneck compared to RAM performance within the GPU. Basically, 3D engines regard the GPU as a separate computer. In the original design concepts (back in the 80's and 90's), the GPU could even be in a physically separate cabinet, and the bus might even be a network or other bus connection (like SCSI for display cards).
The output buffer is just a block of RAM, and within a single GPU there is fast access to this buffer. When the output buffer is divided into tiles, each GPU can render it's designated tiles, leaving empty spaces, like a haphazard chessboard. When all pixels have been rendered for all tiles, the tiles can be accumulated into a completed image for display, but this implies part of the content must be transmitted between cards.
There is an out. I don't know which cards may deploy this solution, but it's in the research (white papers by PhD candidates and scientists). The display circuitry, which provides color pixel by pixel to the display, can be masked, such that through the connectors on the top of the cards (that is, a private bus - not the PCIe bus) the relevant portions of the tiled or banded output can be selected from whichever card processed that tile or band. This is not applicable to the "stacked" RAM approach.
In the tiled or banded approach, each tile represents only a portion of the display, but contains a complete rendering of all 3D scene content. In the "stacked" RAM approach, whatever the output buffer contains (tiled or not...these two concepts could be combined), it is not necessarily a complete 3D scene content. Some of the 3D content may have been rendered by the opposing GPU. This leaves a resolution problem.
Let's say each card processed the entire display, divided by 3D scene content so the RAM is "stacked". Card A might have an object on display that appears to be by itself. Card B has an object on display which is actually in front of the object shown in Card A's output. Pixel by pixel, a determination must be made as to which object is in front. That is the color shown for each pixel involved. To solve this, a z-buffer might be used. That is, for each pixel, the color buffer holds the pixel a single object that is closest to the camera, and the Z-buffer holds the distance from the camera is it's Z coordinate at that pixel. Compare the two Z coordinates, and you can decide which object is in view, which one is obscured (this this isn't by object, it's by pixel).
This doesn't account for transparency. Transparency is processed as a second pass. Some GPU designs require geometry to be sorted from back to front, while others don't have a preference. Yet, all of them require that opaque objects are processed first, so that transparencies (and other special effects) can be calculated by incorporating the color of the object(s) behind the transparency (or effect). This becomes highly complicated if GPU's are processing different 3D scene content.
As SPS's inquiry implies, a simple (read oversimplified) solution would be to require forward rendering. It doesn't resolve all the issues, but it takes a good step forward.
However, the real problem is merely which color to use for a transparency or effect calculation. If Card A, from above, were processing a transparency, but didn't have the color from Card B for an object that's really in front of the forward most object Card A "knew", the transparency calculation would be incorrect.
This implies that the output buffer must be resolved to a single output before a transparency phase can be executed. There are numerous research articles on the topic, but I can guess which will be employed.
Instead of thinking of the output buffer is a traditional 2D image to be put on display, an intermediate output buffer of higher complexity is proposed, with multiple "panes" or "layers"...like photoshop layers. Transparency or special effect processing requires a geometry calculation (a vertex phase in the pipeline), then a fragment calculation (the pixel by pixel processing) to contribute to an output color. If this is on a separate layer, with all the required information, a new type of deferred processing is implied. One where the output of each GPU, which is of partial 3D scene content, is shipped to the adjacent card(s) for resolution by a fast process of limited demand. These layers are "flattened" by a simplified post processing phase, resolving all issues with respect to depth and transparency.
There is another approach, though. This can be transported via the external bus (that connector in the top of the cards), where output calculation is programmed via a new set of instructions. Basically, as each pixel is being sent to the display, it's evaluated, in real time, with respect to the depth/transparency calculations. This is quite possible, but requires hardware support. The traditional output of a finished 2D image is traded for a more complex buffer of instructions per pixel, instead of visible pixels. As each pixel is considered for output, these instructions are like well encoded machine language instructions which give the output processor the ability to resolve the correct color, even though the data from each card is basically incomplete.
There are other approaches, hybrids...and of course I can't tell which one(s) each company may prefer until product is in hand (and I've already been called out for going on too much

).