I am not a SME in chip architecture. But this is how I understand it: the bottleneck isn't in the computational steps per clock cycle anymore but in how fast the processor transfers data from memory.
. That means it has 1 Tx and Rx channel, with each Tx channel transmitting data as a sequential digital signal. Example: you want to transmit signal A,B,C,D,E. On a serial connection, your computer then transmits A, B, C, D, E, in order. If you have more than 1 channel, you can write more. This is a linear scaling. For 2 channels, you can read/write to/from DRAM 2x faster. So it can transmit i.e. A1, A2, A3, A4, A5, A6. Then B1, B2, B3, B4, B5, B6.
.
. That's limited by the form factor of the memory stick.
How about stacking those DRAM dies, integrating the DRAM controller straight into your CPU, and reading off the DRAM controller directly with arbitrary channels? 288 pins is pretty big because it has to deal with PCB traces, but you can get thousands of interconnects for IC packages easily. Not SME, ask an actual EE for details.