Actually there's quite a few reasons impacting NVidia's pre-Turing asynchronous compute performance, the cache issue you're talking about is specifically for asynchronous integer and floating point pipelines. Later Maxwell had basic async compute support because it supported a graphics queue alongside compute queues, but it didn't have dynamic resource allocation which was a major hit to attempting to use DX12 Asynchronous compute capabilities. Pascal fixed this and added dynamic load balancing between all the queues, but the scheduler was still pretty weak and couldn't handle individual thread scheduling and used a per-warp scheduling system rather than per-thread and of course didn't allow concurrent execution of integer and floating point pipelines while as you mentioned required the results of memory address calculations to be stored in a different cache to the one used for addressing which required a transfer, which incurred further penalty, while Volta's scheduler and execution fixed many of these problems. While AMD already had multiple dedicated Asynchronous Compute Engine blocks on their cards from GCN's start which were essentially graphics command controllers for compute pipelines and already did per thread scheduling as if it were a CPU in a way, so most of these issues never really existed, they only added concurrent FP+INT execution with Vega or PS4Pro.
Sometimes concurrent FP+INT calculations is referred to as asynchronous compute but this is a very different thing to the DX12 concept of asynchronous compute and when used in games INT+FP concurrency is referred to as Rapid Packed Math by AMD, though the former does impact performance of some implementations of the latter.
Games using DX12 should have separate pipelines and code paths for different architectures to properly make use of the API, in fact some in the industry if you don't have the resources for that then a developer should just stick with DX11, if a game runs badly on an architecture it's possible the developers just didn't bother to implement a well optimised path for it or it's going to a fallback layer, or there's only one codepath and that's a modified version of an existing one from the consoles ones and is best suited to similar architectures.
There's no benefit to removing abstraction if you're not going to make use of the extra ability to fine tune to the hardware basically, and if you remove that abstraction and then try to run code create for a different arch then it's obviously going to run into issues, the key technological benefit of abstraction is hardware agnosticism which is one reason(Besides the abstraction itself) why DX11 is so much easier to work with, a single piece of code in theory should work mostly the same way across different architectures.