This paper was accepted on the How Far Are We from AGI? workshop at ICLR 2024.
Imaginative and prescient-Language Fashions (VLMs) reminiscent of GPT-4V have just lately demonstrated unimaginable strides on various imaginative and prescient language duties. We dig into vision-based deductive reasoning, a extra subtle however much less explored realm, and discover beforehand unexposed blindspots within the present SOTA VLMs. Particularly, we leverage Raven’s Progressive Matrices (RPMs), to evaluate VLMs’ skills to carry out multi-hop relational and deductive reasoning relying solely on visible clues. We carry out complete evaluations of a number of standard VLMs using normal methods reminiscent of in-context studying, self-consistency, and Chain-of-thoughts (CoT) on three various datasets, together with the Mensa IQ take a look at, IntelligenceTest, and RAVEN. The outcomes reveal that regardless of the spectacular capabilities of LLMs in text-based reasoning, we’re nonetheless removed from attaining comparable proficiency in visible deductive reasoning. We discovered that sure normal methods which are efficient when utilized to LLMs don’t seamlessly translate to the challenges offered by visible reasoning duties. Furthermore, an in depth evaluation reveals that VLMs battle to unravel these duties primarily as a result of they’re unable to understand and comprehend a number of, confounding summary patterns in RPM examples.