I am not a SME in chip architecture. But this is how I understand it: the bottleneck isn't in the computational steps per clock cycle anymore but in how fast the processor transfers data from memory.
. That means it has 1 Tx and Rx channel, with each Tx channel transmitting data as a sequential digital signal. Example: you want to transmit signal A,B,C,D,E. On a serial connection, your computer then transmits A, B, C, D, E, in order. If you have more than 1 channel, you can write more. This is a linear scaling. For 2 channels, you can read/write to/from DRAM 2x faster. So it can transmit i.e. A1, A2, A3, A4, A5, A6. Then B1, B2, B3, B4, B5, B6.
.
Some things that slow it down: The signal from the Intel Xeon Gold has to travel along the wires in the PCB, which introduces a time delay. Traveling outside the chip introduces EM noise, such as RF noise from external RF signals picked up by PCB traces acting as antenna, etc. To remove this noise, you introduce noise reducing measures like filters - but that makes it just a little bit slower.
Now instead of 6 channels along a PCB trace, you wafer bond the DRAM directly to your CPU. Now there's 0 travel time delay and reduced noise. But the big one is, you can have arbitrary channels:
. That's limited by the form factor of the memory stick.
How about stacking those DRAM dies, integrating the DRAM controller straight into your CPU, and reading off the DRAM controller directly with arbitrary channels? 288 pins is pretty big because it has to deal with PCB traces, but you can get thousands of interconnects for IC packages easily. Even a simple macroscopic ball-grid array (BGA) can get you 2400 pins, so what about lithographically defined interconnects? Maybe 2.88 million?