Basicly, a LLM has dozens of transformer layers (attention layer + ffn layer). We cannot load the whole model data, or even a single transformer layer on NPU sram(~1MB). So, we put the whole model in RAM, transport a slice of data to NPU by command stream each time, then next slice. Several times of the model size may be transported between ddr and npu. This cost is huge so I still run LLM on pure cpu.