PCI-E Bandwidth Discussion

SPS · Sep 22, 2014

These posts have been moved from the AMD teasing thread to keep that thread on topic.

Mr Whippy said:
But the bottleneck in batches/draw calls is to do with CPU load, not bus bottlenecks.

So mantle etc are about reducing cpu load proper, or doing more effective multi threading.

Bus bandwidth on pci bottlenecks, I'm less sure about.

Does anyone actually know the frame by frame pcie data traffic loads and what it consists of and why more bandwidth alleviates a current measurable bottleneck?

I'm not sure I understand it right, but is local vram used for storing textures and geometry as well as frame buffers etc, or is texture/geometry pushed to gpu from system ram per frame?
From tests I gathered it was the former, in which case in not sure why it'd be a huge issue with current bus bandwidth.
If its the latter then I'm not sure but we'd probably be bus bandwidth bound constantly which isn't the case!

Hence my confusion on that point hehe

All GPU data is stored on the GPU. The only thing that usually gets pushed to the GPU every frame are shader parameters (constant buffers in DX11).

For example, a vertex buffer for a mesh is generally loaded by the CPU and then pushed to the GPU never to be touched again by the CPU. To move a mesh around, a transformation matrix is sent to the GPU (very little data compared to the entire vertex buffer). The vertex shader is then responsible for transforming each vertex by the matrix.

Mr Whippy said:
Do you actually have some data, I'm genuinely interested in the data flow and what it's made up of on a frame to frame and second to second basis... to 'fill' gigabytes a second of bandwidth.

Cheers

Dave

I do have some data yes, I'd have to dig out the specifcs when I'm back from work but I can tell you it slowed my frame down by at least 1000x. This was done in a single threaded DX11 app and the test was reading back from the GPU. APIs tend to be a bit more awkward when it comes to mapping back to the CPU so I couldn't tell you how much the API directly affected these timings.

Mr Whippy · Sep 22, 2014

SPS said:
All GPU data is stored on the GPU. The only thing that usually gets pushed to the GPU every frame are shader parameters (constant buffers in DX11).

For example, a vertex buffer for a mesh is generally loaded by the CPU and then pushed to the GPU never to be touched again by the CPU. To move a mesh around, a transformation matrix is sent to the GPU (very little data compared to the entire vertex buffer). The vertex shader is then responsible for transforming each vertex by the matrix.

I do have some data yes, I'd have to dig out the specifcs when I'm back from work but I can tell you it slowed my frame down by at least 1000x. This was done in a single threaded DX11 app and the test was reading back from the GPU. APIs tend to be a bit more awkward when it comes to mapping back to the CPU so I couldn't tell you how much the API directly affected these timings.

Sounds good, I'm excited to see your data!

I did some rough tests here on my old GTX275/p867 mobo/2500k.

You'll have to excuse the less than scientific approach but it was just checking the logic of the entire idea of GPU vs system RAM storage.
http://www.racedepartment.com/threads/racers-graphics-engine-graphics-generally-memory.54984/

But in the end it seemed that even huge amounts of graphic textures could be shipped across the PCI slot within a single frame of rendering, at 60fps+

Since my old GTX275 had just 1/3rd the VRAM required to store those textures, ignoring HDR cube maps, frame buffers, shadow maps etc, then the VRAM > sys RAM bus bandwidth was clearly pretty damn high to render out 3gig worth of scene textures at 60fps.

Now obviously this isn't ideal, and it's clear that at frame render time if data can't be stored locally it comes over from system ram.
This comes at a cost vs local VRAM lookup, but it clearly wasn't totally terrible having to resort to the system ram over the bus.

Indeed we only need to look back to the early AGP days to see that getting access to data required for nice textures/data from system ram by the GPU has been a priority task for many years.

There is obviously LOADS of bandwidth even on those old mobos to be shunting data around.

I'm still more of the view that the real costs to fast performance are the CPU overheads proper, rather than anything in the bus itself.

I'll re-jig that test track for something crazy like 16GB of textures and see what happens with it all in view

Dave

SPS · Sep 22, 2014

So you're tests were done by creating a racetrack map that contained lots of texture data? As we don't have access to the code we have no idea what the engine is doing and so I'm a bit dubious to rely on this as a plausible form of testing. I might try and create a very small test app purely test this and see what numbers we get.

Mr Whippy · Sep 22, 2014

Well in this case the test was done to find the impacts of texture distribution and batch optimisation somewhat for that game engine.
Ie, would batch X on geometry Y that was LOD'd out cost GPU VRAM if the texture used was in excess of the GPU VRAM size etc.

But the point was that texture data in the test was only coming from one place at 60hz, and that was over the PCI-E bus.

If there are other overheads then that only pushes the expected bandwidth guesstimate from this kind of test higher, rather than lower.

I suppose some kind of compression may push it lower but I'm not sure that is going on. Compression in system ram to reduce bus data loads?!

Cheers

Dave

SPS · Sep 22, 2014

It would only load the textures once, not every frame though. Unless I missing something here?

Mr Whippy · Sep 22, 2014

If you only have say 512mb vram after overheads and your scene has 1.5gb of textures, it must be sending ~ 1gb per frame from system ram to get all that texture data rendered in scene.

It needs to get those textures to VRAM at some stage to render them on every frame.

Unless I've really missed something somewhere hehe... but I'm not sure I have at this point... This was the exact reason we got AGP in the late 90's because GPU vram was tiny for textures and PCI at the time was super slow for high fps texture transfers.

Hmmmm...

So in theory, I'd say a pci-e bottleneck would show up when fps really start to drop in response to huge textures volumes needing to be sent via pcie.

Say 3gb might be un-noticeable but 4gb might start slowing us down as each frame is waiting on textures coming over from ram for increasingly long durations. Ie a bottleneck.

All just assumption from observation but it sounds about right.

And why I don't think we have any real bottlenecks in fps right now because of pcie bottlenecks.

Cheers

Dave

SPS · Sep 22, 2014

Mr Whippy said:
If you only have say 512mb vram after overheads and your scene has 1.5gb of textures, it must be sending ~ 1gb per frame from system ram to get all that texture data rendered in scene.

It needs to get those textures to VRAM at some stage to render them on every frame.

Unless I've really missed something somewhere hehe... but I'm not sure I have at this point... This was the exact reason we got AGP in the late 90's because GPU vram was tiny for textures and PCI at the time was super slow for high fps texture transfers.

Hmmmm...

So in theory, I'd say a pci-e bottleneck would show up when fps really start to drop in response to huge textures volumes needing to be sent via pcie.

Say 3gb might be un-noticeable but 4gb might start slowing us down as each frame is waiting on textures coming over from ram for increasingly long durations. Ie a bottleneck.

All just assumption from observation but it sounds about right.

And why I don't think we have any real bottlenecks in fps right now because of pcie bottlenecks.

Cheers

Dave

Yes this is known as texture streaming, but it's much slower than loading it all into VRAM and can cause stalls. Sometimes it cannot be avoided for your target platform so you have to accommodate it as part of your frame budget. Also the whole texture may not be loaded in when required as it will depend on the mip level required.

However, there are lots of other interesting things we could do with the GPU if the bandwidth was not so bad.

Mr Whippy · Sep 23, 2014

Can a mip really be requested over PIC-E bus by the graphics card from system ram?

Surely it'll load the full one and grab the mip it requires? Unless the system ram allows storage and retrieval of mip maps like that, but I'd be surprised as some formats generate mips on the card at load time, rather than store them.

I'm not sure it's important for the layman to know, but it's very important when discussing bottlenecks and comparing what upgrade paths to make.

Right now it seems the fastest pci-e bus speeds allow ~ 30Gb/s of data to move across the bus in both directions, a total of 60gb/s transferred I'm assuming.

So what is using up all that bandwidth and causing it to be a bottleneck for anyone?

If we're not streaming in external textures which are the biggest consumer of VRAM I know, then what else are we streaming in such volumes?

In my brief testing using streaming of excessively sized textures over PCI-E, it appeared to have an effect on FPS, but even at the lowest FPS speeds the bandwidth required would still need to be very high to transmit that data at that rate.

Ie, 60fps in a very poor case scenario isn't a particularly bad bottleneck when you've brute forced a completely unrealistic situation upon the PCI-E bus workload.

I'm happy to understand more about what data there is flying over the bus and how it can possibly be causing bottlenecks that need immediately alleviating on current hardware.

But for now I'm still firmly of the mind that the biggest single pain in the ass is batch CPU overheads, not the data transfer speed to the CPU/ram.
This is why we're seeing AMD develop Mantle.

Maybe people merely describe that bottleneck as a bus/PCI-E > CPU bottleneck, but it's wrong to describe it like that.
The batch bottleneck is ON the CPU itself, because of inefficient multi-threaded, and inefficient API.

There isn't much consumers can do about this right now really, but just wait for better API to arrive.

To allude to the idea that their PCI-E isn't fast enough and that is why their FPS are low just means they're running out to buy faster mobo/CPU, and if they see a faster FPS it'll be because the CPU is faster, not their PCI-E...

In my view at least

Maybe I'm wrong but I've yet to see a convincing argument that PCI-E is bottlenecked for the 99% or so of consumers why are not running triple 4k screens and quad Titans

Thanks

Dave

SPS · Sep 23, 2014

No you'd have a bespoke lodding system on the CPU that would be responsible for uploading the correct textures, the GPU can't request them.

Also geometry streaming is another thing that sometimes needs to be done.

Anyway the point that you don't see the bottlenecks much is because developers avoid it or accommodate for it.

I agree that the CPU cannot send draw calls quick enough due to the driver thread.

Mr Whippy · Sep 24, 2014

What's a bespoke lodding system?

Isn't this stuff built into the API and so transparent?

Ie, when the draw call is made the CPU just says that the GPU needs to load texture X from system memory to complete the job, because it's not on the GPU memory?

I'm sure some systems have optimisations etc, but isn't the whole point of the API to control the majority of all these kinds of issues so they are transparent to the graphics programmer?

I know if you were writing some engine in assembly you'd be saying exactly where every bit of data goes and how, but in OGL/DX you surely have a really robust starting architecture to cover all these issues?

I'm still struggling with the concept of it all though.

If textures are not generally expected to go over PCI-E bus, and are instead meant to fit into GPU memory, then what is filling the PCI-E bus to make it a bottleneck currently?

And if textures are expected to go over the PCI-E bus because VRAM is used up, then from quick tests it doesn't seem like the cost isn't particularly high even when spamming the PCI-E bus with stupidly high texture data.

I'm still struggling to consider that PCI-E bus is a bottleneck for 99% of gamers out there running a balanced computer system.

If we are going to say it's a bottleneck we need proof to say it is the case, so we can then tell people what the best ways are to alleviate/minimise it with their build.

Thanks

Dave

SPS · Sep 24, 2014

Mr Whippy said:
What's a bespoke lodding system?

Isn't this stuff built into the API and so transparent?

Ie, when the draw call is made the CPU just says that the GPU needs to load texture X from system memory to complete the job, because it's not on the GPU memory?

No, this is all manual.

Eg. to load a texture on the pipeline.

First you must create a texture object from loaded texture data using
ID3D11Device::CreateTexture2D.

Then you must create a shader resource view to sample the texture in the shader using ID3D11Device::CreateShaderResourceView.

Then after you have set the shader on the pipeline you must load the shader resource view into a slot using ID3D11DeviceContext:

SSetShaderResources. (sets it for the pixel shader).

That gets the shader viewable but to sample it you must create a sample state first using ID3D11Device::CreateSamplerState.

Then bind this sampler state to the pipeline using ID3D11DeviceContext:

SSetSampler.

Now you can call draw on your primitives and use textures in your shader.

Mr Whippy said:
I'm sure some systems have optimisations etc, but isn't the whole point of the API to control the majority of all these kinds of issues so they are transparent to the graphics programmer?

I know if you were writing some engine in assembly you'd be saying exactly where every bit of data goes and how, but in OGL/DX you surely have a really robust starting architecture to cover all these issues?

I'm still struggling with the concept of it all though.

If textures are not generally expected to go over PCI-E bus, and are instead meant to fit into GPU memory, then what is filling the PCI-E bus to make it a bottleneck currently?

And if textures are expected to go over the PCI-E bus because VRAM is used up, then from quick tests it doesn't seem like the cost isn't particularly high even when spamming the PCI-E bus with stupidly high texture data.

I'm still struggling to consider that PCI-E bus is a bottleneck for 99% of gamers out there running a balanced computer system.

If we are going to say it's a bottleneck we need proof to say it is the case, so we can then tell people what the best ways are to alleviate/minimise it with their build.

Thanks

Dave

Like I said the bottleneck is actively avoided in commerical products. To look at some data just search on google scholar for research papers.

Mr Whippy · Oct 6, 2014

This was an interesting little video showing the PCI-E bandwidth not really being a huge bottleneck.

https://www.youtube.com/watch?v=rctaLgK5stA

4k gaming and halving the PCI-E bandwidth from 16x to 8x didn't really cost any FPS at all.

Again yes, the bottlenecks have been avoided by being good commercial products, but just by sensible design and consideration.
I'm not sure there are some hidden API calls or custom coding going on there that other people don't know about or can't achieve.
Everything done is within the bounds of the API. That is exactly why AMD have developed Mantle because D3D and OGL have some limitations that they can't creatively program/optimise out of.

As per the D3D setup for the shader, they are all API calls though. Where does automatic become manual is the question I suppose.

From a perspective of programming with a graphics API like OGL or D3D, that is massively automatic vs writing a graphics engine for a Game Boy Advance, which is literally ground up bespoke from game to game, well good ones any way.

I'm still genuinely interested how much customisation for optimisation is possible via OGL or D3D API, that isn't just assumed best practice and so seen in all commercial implementations any way.

I'll try Google for some specific data, but I'm still struggling to see the smoking gun of PCI-E being a bottleneck in any way shape or form in current implementations.

Dave

Puck · Oct 6, 2014

Okay, so this is a bit of a noob question then...

I have a wireless card that fits in, what I'm assuming is a PCI-E x1 slot, pictured here:
http://www.newegg.com/Product/Product.aspx?Item=N82E16833166076

Avermedia Recording device that also fits in a PCI-E x1 slot, pictured here:
http://www.newegg.com/Product/Produ...00&cm_re=live_gamer_hd-_-15-100-100-_-Product

And a GTX 700 series Graphics card which is a PCI-E x16.

So because my processor, a 4790k, has only 16 PCI Express lanes... the Graphics card is actually operating at PCI-E x8 speeds?

Am I understanding that right?

SPS · Oct 6, 2014

Ok maybe the word bottleneck is bogging you down.
The point what I was trying to get across is that PCI-E transfer is SLOW and if not actively avoided, it would become the bottle neck of the system.

PCI-E Bandwidth Discussion

SPS

Moderator

Mr Whippy

New member

SPS

Moderator

Mr Whippy

New member

SPS

Moderator

Mr Whippy

New member

SPS

Moderator

Mr Whippy

New member

SPS

Moderator

Mr Whippy

New member

SPS

Moderator

Mr Whippy

New member

Puck

New member

SPS

Moderator

Similar threads