Though Giant Language Fashions (LLMs) have proven promise for human-like conversations, they’re primarily pre-trained on textual content information. Incorporating audio or video improves efficiency, however amassing large-scale multimodal information and pre-training multimodal LLMs is difficult. To this finish, we suggest a Fusion Low Rank Adaptation (FLoRA) method that effectively adapts a pre-trained unimodal LLM to devour new, beforehand unseen modalities by way of low rank adaptation. For device-directed speech detection, utilizing FLoRA, the multimodal LLM achieves 22% relative discount in equal error charge (EER) over the text-only method and attains efficiency parity with its full fine-tuning (FFT) counterpart whereas needing to tune solely a fraction of its parameters. Moreover, with the newly launched adapter dropout, FLoRA is strong to lacking information, enhancing over FFT by 20% decrease EER and 56% decrease false settle for charge. The proposed method scales effectively for mannequin sizes from 16M to 3B parameters.