Voice assistants assist customers make cellphone calls, ship messages, create occasions, navigate, and do much more. Nevertheless, assistants have restricted capability to know their customers’ context. On this work, we intention to take a step on this route. Our work dives into a brand new expertise for customers to consult with cellphone numbers, addresses, e-mail addresses, URLs, and dates on their cellphone screens. Our focus lies in reference understanding, which turns into significantly attention-grabbing when a number of comparable texts are current on display, much like visible grounding. We gather a dataset and suggest a light-weight general-purpose mannequin for this novel expertise. Because of the excessive price of consuming pixels straight, our system is designed to depend on the extracted textual content from the UI. Our mannequin is modular, thus providing flexibility, improved interpretability, and environment friendly runtime reminiscence utilization.