Lengthy prompts current a big problem for sensible LLM-based programs that must function with low latency and restricted sources. We examine immediate compression for zero-shot dialogue programs that study to make use of unseen APIs instantly in-context from their documentation, which can take up a whole bunch of immediate tokens per API. We begin from a lately launched strategy (Mu et al., 2023) that learns to compress the immediate into a couple of “gist token” activations throughout finetuning. Nevertheless, this easy thought is ineffective in compressing API documentation, leading to low accuracy in comparison with the baseline utilizing an uncompressed immediate. On this work, we introduce two main enhancements. First, we specialize gist tokens for various hierarchies inside an API: we use one Gist_arg token for compressing an argument and one Gist_value token for compressing an appropriate worth of a categorical argument. We then dynamically reveal Gist_value tokens solely when they’re wanted. Second, we add a reconstruction loss to foretell the API documentation from the gist tokens. On a number of API-calling duties, our proposed system retains the simplicity, effectivity, and enormous compression issue (20x on SGD) of the gist token strategy whereas attaining considerably higher accuracy.