Cache architecture is a 3 way trade off between access time, size and power consumption, and logical capacity.
The L1 caches are the smallest because they need to be accessed nearly every cycle (L1 instruction cache anyway). Most modern microarchitectures can access the L1 instruction and data caches in 4-5 cycles. If everything is working as it should, there's a 90%-95% probability that the requested datum is in the appropriate L1 cache and no further penalty is incurred.
The L2 caches are larger and slower. The L2 cache is accessed when the requested datum is not present in the appropriate L1 cache. Accessing the L2 cache adds an additional 6-10 cycles of overhead depending on the particular microarchitecture. During this time the microprocessor must either stall or find something else to do. The hit rate on the L2 cache compounds with the L1 cache.
Dynamically scheduled microprocessors (all modern x86 microprocessors except for Intel Atom) can easily keep themselves busy enough that an enormous cache is not critical to performance. Statically scheduled microprocessors such as Intel's Itanium and IBM's Power6 (although not other Power architectures) incur an enormous penalty, so huge caches are more common on these microprocessors.
Where present, the L3 cache is even larger and slower than the L2 cache. Access time for the L3 cache is typically between 30 and 40 CPU cycles. The hit rate on the L3 cache compounds with the L2 cache to further improve total cache hit rate. For example, a microprocessor may have a 90% cumulative hit rate on L1, 93% cumulative hit rate on L2, and a 95% cumulative hit rate on L3.
Since the L3 cache is usually shared by all cores within the same CPU package, the L3 cache also helps with data synchronization between logical processors. For example, if Core0 loads a datum that datum's cache block will be loaded into the cache all the way down to Core0's L1 cache. If Core0 then modifies that datum (especially if the modification is atomic) the change must be reflected on all other cores. If Core1 also has that datum loaded, its cached copy must be invalidated and reloaded. If there is a shared L3 cache, it can be loaded in approximately 30 cycles. If there is no shared L3 cache, Core1 may have to wait until Core0 writes it back to the main memory and then reload it from there. This can cause a substantial penalty on the order of hundreds of cycles.