Massive Language Mannequin or LLM inference has two phases, the immediate (or prefill) section to output the primary token and the extension (or decoding) section to the generate subsequent tokens. On this work, we suggest an environment friendly parallelization scheme, KV-Runahead to speed up the immediate section. The important thing commentary is that the extension section generates tokens sooner than the immediate section due to key-value cache (KV-cache). Therefore, KV-Runahead parallelizes the immediate section by orchestrating a number of processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Twin-purposing the KV-cache scheme has two essential advantages. First, since KV-cache is designed to leverage the causal consideration map, we decrease computation and computation routinely. Second, because it already exists for the extension section, KV-Runahead is simple to implement. We additional suggest context-level load-balancing to deal with uneven KV-cache era (because of the causal consideration) and to optimize TTFT. In contrast with an current parallelization scheme resembling tensor or sequential parallelization the place keys and values are domestically generated and exchanged by way of all-gather collectives, our experimental outcomes show that KV-Runahead can provide over 1.4× and 1.6× speedups for Llama 7B and Falcon 7B respectively.