DirectX 12 Will Allow Multi-GPU Between GeForce And Radeon Configs?

WYP

News Guru
Most PC gamers are excited about DX12, the next Generation API from Microsoft, with many rumours indicating that Radeon and Nvidia Multi-GPU Asynchronous configs may be possible.

25103228384l.png


Read more on DX12's Multi-GPU capabilities here.
 
Last edited:
the magic words there are:"if developers take advantage of this" there are way to many devs that can't be bothered/don't care, as good as these news are , i am very skeptical if this will ever happen
 
As far as I can remember Johan Andersson(DICE, The Man Behind their Frostbite Engine) told similar things about Mantle API at AMD Developer Summit in 2013.
 
it would be nice if i could have my i5 gpu doing something.
I only just figured out how to boost video encoding times using the dumb thing. had it enabled in the bios and everything but never worked at all.
then i just told my computer to pretend i had a monitor connected to it and all of a sudden i could encode videos about 7x faster. i found that quite impressive so i would like to utilize it more. I hope that dx12 does allow me to utilize it in some way even if all it does is give me bragging rights.
 
I am sure the API will allow cross platform, the question is will Nvidia allow that? I am pretty confident AMD wouldn't care but Nvidia is a lot like Apple.
 
I'm a developer, 3D rendering engines for visualization and gaming among the targets I've worked on (concentrating on mobile these days).

I can tell you a great deal about these concepts.

First, there is no real obstacle to using GPU's of different origins, as long as the existing drivers can be made to coexist (which hasn't always worked between nVidia and AMD/ATI). There are several avenues, depending on how one codes to the GPU workload. The result isn't much of a positive net effect, but it's been possible for a while, if only as a curious experiment.

Mantle was about the first to hint at these possibilities, about reorganizing the nature of RAM in the GPU, unifying access to memory, but OpenGL is incorporating most of these too. DX12 may have been rather quiet, or it's late to that game, but whichever, the concepts aren't quite the benefit you might expect.

It's been metaphorically described as stacking memory. This isn't an entirely accurate description, but does describe the contrast to the current duplicated scene and asset content that's been required of Crossfire and SLI. In the new paradigm, we are simply able to avoid this duplication in each card, not treat the two cards as having a combined storage of twice the RAM (assuming symmetrical systems, not the proposed cross-vendor GPU notion).

At the risk of writing a wall of text, the current paradigm has relied on duplicated assets so that each GPU can render an alternate frame, providing some parallel action of two or more cards. The problem is that there is an inherent frame delay involved, and there is some limit to how much parallel benefit there really is. You'll see 100% GPU utilization on all cards, assuming the CPU is up to the task, but you might not actually get double the overall framerate, depending on the complexity of scenes and the action of physics and other object controllers in the game engine.

The better approach is to band or tile the output buffer, causing each GPU to render a portion of the display. This is an old technique, applied to multi-core CPU's in high end rendering like the output of Maya, 3DS Max, Soft Image, and others. Scaling is almost linear through about 16 to 32 cores. The idea would be to put each GPU to work on portions of the display, each with duplicate RAM content (eliminating any benefit of 'stacking' the RAM). From a purely theoretical, mathematical, logical perspective, this is the best means of implementing parallel GPU's.

There is a problem, though. The GPU's are on opposite side of a bus. The common output display buffer has to be accumulated from the two cards, but it's a workable problem. I can't tell, yet, if DX12 allows for this paradigm, but it should.

The other option is to divide scene content and assets (the textures and models of the 3D scene) into the multiple GPU cards. This is what is being called "stacking" the RAM. It does NOT present a single, unified RAM space (two 4G cards don't function like one 8G card, despite what the articles say). Instead, we are able to have each card render a portion of the scene content to a common output buffer. Basically, half of the objects (and their textures) are sent to either card in two card arrangement, or a quarter of the objects are sent to one of each of the 4 cards in a quad. Each GPU processes what 3D content it has, providing output to a z-buffer and display buffer, which is ultimately resolved to a single output display in a final processing phase.

The reason this isn't exactly stacking is that the boundary for division isn't even, and you can't use RAM from one card in another card's rendering. That would cause too much bus traffic to be practicable. GPU's are very heavy RAM performance hogs, and the bus would kill performance. So, the division of assets has to be calculated to the closest reasonable boundary approximating half the storage requirement, but it won't be exactly even. Ideally, two 4G cards would each receive up to 4G of content, but in reality they'll only get about 3.7 to 3.9 at most.

In other words, two 4G cards won't work like a single 8G card, it will act metaphorically like a 7.5 G card, which is close.

There's another catch. The division of assets according to an equal amount of storage required does not automatically equate to an equal division of workload on the GPU's. This means that for each frame one GPU might finish well before the other(s), returning a diminished benefit. It would be up to the game engine to determine how assets are divided, and experimentally determine an equitable division of workload as much as an equal division of storage.

In contrast, the tiled or banded output of parallel GPU's storing identical scene content is almost always a very close finish for all CPU's involved, providing a much better utilization of GPU resources.

This management burden is not something game engine designers have considered in previous generations, and as one who has done that work in the past, I can tell you it's not exactly attractive, but has potential gains. You could see more complex scenes, texture and effects, reduced lag and overall improved frame rates.

Yet, ONLY if one needs more RAM does this curious approach to dividing assets make any sense. The management burden is high, and can be countered by a more conventional approach producing better GPU utilization, with optimization techniques already practiced at utilizing what RAM is available.
 
Last edited:
^^^ How long did that take to write? :p

So essentially what you are saying is that while it is possible it'll never truly be exact? in other words meaning one side of the screen(in a half by half render portion from the gpus) will be behind the other half because one gpu finished before the other?? And overall complex but great gains(like a high risk(complexity) high reward)?
 
:o My brain just exploded from trying to understand all that in one go.... this damn man flu is slowing me right down. :mad:
 
I'm a developer, 3D rendering engines for visualization and gaming among the targets I've worked on (concentrating on mobile these days).

[...]

I think you've explained this in a rather odd way, you seem to repeat yourself several times?

I agree what you say about the RAM - they are in different physical places so you may need to duplicate memory. But are we talking about forward rendering only here?
 
Buy MSI 870A and voila, you got AMD+Nvidia working together :)
Btw as far as i understood, with dx12 ut would be 4+4=8 unlike traditional 4+4=4
Their commercial could be 4+4=4, well not anymore xD
 
^^^ How long did that take to write? :p

So essentially what you are saying is that while it is possible it'll never truly be exact? in other words meaning one side of the screen(in a half by half render portion from the gpus) will be behind the other half because one gpu finished before the other?? And overall complex but great gains(like a high risk(complexity) high reward)?

Well, I type fast. I had to edit a couple of typos, but it's a stream of consciousness post. :p

Basically you have it. Yet, I may have left confusion about the two approaches.

The one where we can "stack" memory, division is not by the display (2D), it's by scene content (think, front to back, or walls vs characters, not the display left/right - at least not strictly).

In the version where we divide the display (2D), we can't "stack" RAM. Each card still has the same scene content, but each GPU renders a different portion of the 2D output. This latter version has a much greater likelihood of good synchronization between the GPU's.

I think you've explained this in a rather odd way, you seem to repeat yourself several times?

I agree what you say about the RAM - they are in different physical places so you may need to duplicate memory. But are we talking about forward rendering only here?

Well, it was a stream of consciousness post, with a "intro/re-examine" layout.

This isn't limited to forward rendering, it can be any mixed model. The choice of forward or not forward is really up to the GPU resources and the complexity of lighting/materials, and most engines mix whatever is required for the effects involved. Although, you're query does recognize there are difficulties.

For others interested, forward rendering is the older style. The 3D pipeline is usually fixed when forward rendering is exclusively used. In deferred rendering, sometimes called deferred shading (and this is getting more complex as the hardware improves), the pipeline is programmable, and the stage where fragments are processed (where each pixel is evaluated) is processed in a later stage, separated (or deferred) until later, compared to the forward rendering model.

This causes some technical problems with output buffering, hence SPS's inquiry about limiting this to forward rendering. When complex lighting, special effects and transparencies are involved, the content of the output buffer must be read (so it can be attenuated by color contribution from light/shadow/transparency), then written back. When the two GPU's are processing different 3D geometry, it would seem this is not possible without some way to synchronize an output buffer from two cards, with two distinct memory spaces.

This is yet another reason that tiled or banded output, dividing the 2D space of the display, not the scene content, is really more performance oriented, while "stacking" RAM would be less about performance and more about extending 3D scene content. In reality, I don't think 3D RAM stacking will be a popular paradigm. It may be faster to market, and more equitable, to simply wait for larger RAM cards.

Yet, there are hardware solutions. They are locked behind opaque doors, and we can only glean information about what MIGHT be a particular design approach based on the research on the subject. It appears one of a few solutions are probably in the works at the hardware level, but that means new cards would be required to support the solution.

I don't expect existing cards will be able to fully deploy these new concepts. The style where 2D display output is tiled or banded is more likely supportable in older cards than RAM "stacking".

This is central to SPS's inquiry. Existing cards might be limited to forward only rendering, which can hinder throughput, if they were to employ RAM "stacking". Yet, to support tiled or banded output, the method where hardware doesn't solve the problem of resolving tiles into single display buffer could mean software oriented copying and bus traffic at each frame.

GPU's are a client/server model. The CPU is the client, the GPU is the server. At each frame, the CPU decides what is on display and drives the 3D pipeline with commands. These commands, and any data which must be supplied or altered to perform them, are sent over the PCIe bus. It's a bottleneck compared to RAM performance within the GPU. Basically, 3D engines regard the GPU as a separate computer. In the original design concepts (back in the 80's and 90's), the GPU could even be in a physically separate cabinet, and the bus might even be a network or other bus connection (like SCSI for display cards).

The output buffer is just a block of RAM, and within a single GPU there is fast access to this buffer. When the output buffer is divided into tiles, each GPU can render it's designated tiles, leaving empty spaces, like a haphazard chessboard. When all pixels have been rendered for all tiles, the tiles can be accumulated into a completed image for display, but this implies part of the content must be transmitted between cards.

There is an out. I don't know which cards may deploy this solution, but it's in the research (white papers by PhD candidates and scientists). The display circuitry, which provides color pixel by pixel to the display, can be masked, such that through the connectors on the top of the cards (that is, a private bus - not the PCIe bus) the relevant portions of the tiled or banded output can be selected from whichever card processed that tile or band. This is not applicable to the "stacked" RAM approach.

In the tiled or banded approach, each tile represents only a portion of the display, but contains a complete rendering of all 3D scene content. In the "stacked" RAM approach, whatever the output buffer contains (tiled or not...these two concepts could be combined), it is not necessarily a complete 3D scene content. Some of the 3D content may have been rendered by the opposing GPU. This leaves a resolution problem.

Let's say each card processed the entire display, divided by 3D scene content so the RAM is "stacked". Card A might have an object on display that appears to be by itself. Card B has an object on display which is actually in front of the object shown in Card A's output. Pixel by pixel, a determination must be made as to which object is in front. That is the color shown for each pixel involved. To solve this, a z-buffer might be used. That is, for each pixel, the color buffer holds the pixel a single object that is closest to the camera, and the Z-buffer holds the distance from the camera is it's Z coordinate at that pixel. Compare the two Z coordinates, and you can decide which object is in view, which one is obscured (this this isn't by object, it's by pixel).

This doesn't account for transparency. Transparency is processed as a second pass. Some GPU designs require geometry to be sorted from back to front, while others don't have a preference. Yet, all of them require that opaque objects are processed first, so that transparencies (and other special effects) can be calculated by incorporating the color of the object(s) behind the transparency (or effect). This becomes highly complicated if GPU's are processing different 3D scene content.

As SPS's inquiry implies, a simple (read oversimplified) solution would be to require forward rendering. It doesn't resolve all the issues, but it takes a good step forward.

However, the real problem is merely which color to use for a transparency or effect calculation. If Card A, from above, were processing a transparency, but didn't have the color from Card B for an object that's really in front of the forward most object Card A "knew", the transparency calculation would be incorrect.

This implies that the output buffer must be resolved to a single output before a transparency phase can be executed. There are numerous research articles on the topic, but I can guess which will be employed.

Instead of thinking of the output buffer is a traditional 2D image to be put on display, an intermediate output buffer of higher complexity is proposed, with multiple "panes" or "layers"...like photoshop layers. Transparency or special effect processing requires a geometry calculation (a vertex phase in the pipeline), then a fragment calculation (the pixel by pixel processing) to contribute to an output color. If this is on a separate layer, with all the required information, a new type of deferred processing is implied. One where the output of each GPU, which is of partial 3D scene content, is shipped to the adjacent card(s) for resolution by a fast process of limited demand. These layers are "flattened" by a simplified post processing phase, resolving all issues with respect to depth and transparency.

There is another approach, though. This can be transported via the external bus (that connector in the top of the cards), where output calculation is programmed via a new set of instructions. Basically, as each pixel is being sent to the display, it's evaluated, in real time, with respect to the depth/transparency calculations. This is quite possible, but requires hardware support. The traditional output of a finished 2D image is traded for a more complex buffer of instructions per pixel, instead of visible pixels. As each pixel is considered for output, these instructions are like well encoded machine language instructions which give the output processor the ability to resolve the correct color, even though the data from each card is basically incomplete.

There are other approaches, hybrids...and of course I can't tell which one(s) each company may prefer until product is in hand (and I've already been called out for going on too much :D).
 
Last edited:
For others interested, forward rendering is the older style. The 3D pipeline is usually fixed when forward rendering is exclusively used. In deferred rendering, sometimes called deferred shading (and this is getting more complex as the hardware improves), the pipeline is programmable, and the stage where fragments are processed (where each pixel is evaluated) is processed in a later stage, separated (or deferred) until later, compared to the forward rendering model.

I think you've gotten a little confused between the fixed function (pre-dx9) and the programmable pipeline?

For others reading, basically in forward rendering you would evaluate lighting per light per vertex. In deferred rendering you render all your model data into screen-space buffers (G-Buffers) and compute the lighting per light per screen pixel, which is way faster. You can then combine this with tiled lighting to compute lighting per tile in parallel in a compute shader (usually 16x16 tiles).

Transparency has issues with deferred rendering as you can't occlude objects that fail the depth test and so most engines (UE4 for example) uses forward rendering for these.
 
i have been known to type out lots.
Actually trying to read it when some one else does it shows me that when greeted with a wall of text some people may actually just not bother reading even the 1st word lol.
 
Can not be bothered with walls of text

@Jasonlylevene please keep it short and simple or people won't read what you post, it can be done.:)

But as to the OP, it is not going to happen and if anyone can manage it then it will be so awful that no one is going to want to use it.
 
Back
Top