Simplified version: an Apple GPU core contains four execution units, each of which is 32-wide (it performs an operation on 32 data values in parallel). An instruction in shader program is executed on one of these units. In other words, there are 128 scalar arithmetical units in an Apple GPU core, capable of executing up to four different 32-wide instructions per cycle.
More complicated, but correct version. An Apple GPU core contains multiple execution units of different types. There are also four instruction schedulers which select a shader instruction and send it to an execution unit. Each scheduler controls one 32-wide FP32 unit, one 32-wide FP16 unit, and (presumably, not quite sure) one 16-wide INT32 unit. So in total you have 4x of those units in a core. On M1 and M2 a scheduler can dispatch one instruction to a suitable execution unit per cycle. This means the other units are idling (e.g. it can do either FP32, FP16, or half INT32 operation per cycle). On Apple M3, schedulers are capable of dual issue and can dispatch two instructions per cycle (e.g. one FP32 and one FP16 or INT) assuming appropriate instructions can be found in the instruction stream. This is why M3 can be much faster on complex shaders even though the nominal spec of the GPU didn’t change much.
Each GPU core executes a large number of shader programs in parallel and switches between shaders every cycle, in order to make as much progress as possible. If it can’t find an instruction to execute (for example because all shaders are currently waiting for a texture load), the units have to go idle and thus your performance potential decreases. This is why it’s important to give the GPU as much work as possible, it helps to fill those gaps (the hardware can run some shaders while others are waiting).
Simplified version: an Apple GPU core contains four execution units, each of which is 32-wide (it performs an operation on 32 data values in parallel). An instruction in shader program is executed on one of these units. In other words, there are 128 scalar arithmetical units in an Apple GPU core, capable of executing up to four different 32-wide instructions per cycle.
More complicated, but correct version. An Apple GPU core contains multiple execution units of different types. There are also four instruction schedulers which select a shader instruction and send it to an execution unit. Each scheduler controls one 32-wide FP32 unit, one 32-wide FP16 unit, and (presumably, not quite sure) one 16-wide INT32 unit. So in total you have 4x of those units in a core. On M1 and M2 a scheduler can dispatch one instruction to a suitable execution unit per cycle. This means the other units are idling (e.g. it can do either FP32, FP16, or half INT32 operation per cycle). On Apple M3, schedulers are capable of dual issue and can dispatch two instructions per cycle (e.g. one FP32 and one FP16 or INT) assuming appropriate instructions can be found in the instruction stream. This is why M3 can be much faster on complex shaders even though the nominal spec of the GPU didn’t change much.
Each GPU core executes a large number of shader programs in parallel and switches between shaders every cycle, in order to make as much progress as possible. If it can’t find an instruction to execute (for example because all shaders are currently waiting for a texture load), the units have to go idle and thus your performance potential decreases. This is why it’s important to give the GPU as much work as possible, it helps to fill those gaps (the hardware can run some shaders while others are waiting).