Gadget-directed speech detection (DDSD) is the binary classification process of distinguishing between queries directed at a voice assistant versus aspect dialog or background speech. State-of-the-art DDSD techniques use verbal cues (for instance, acoustic, textual content and/or automated speech recognition system (ASR) options) to categorise speech as device-directed or in any other case, and sometimes need to take care of a number of of those modalities being unavailable when deployed in real-world settings. On this paper, we examine fusion schemes for DDSD techniques that may be made extra sturdy to lacking modalities. Concurrently, we examine the usage of non-verbal cues, particularly prosody options, along with verbal cues for DDSD. We current totally different approaches to mix scores and embeddings from prosody with the corresponding verbal cues, discovering that prosody improves DDSD efficiency by as much as 8.5% when it comes to false acceptance price (FA) at a given mounted working level through non-linear intermediate fusion, whereas our use of modality dropout methods improves the efficiency of those fashions by 7.4% when it comes to FA when evaluated with lacking modalities throughout inference time.