Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz Khan, Luc Van Gool. How Good is Google Bard′s Visual Understanding? An Empirical Study on Open Challenges. Machine Intelligence Research, vol. 20, no. 5, pp.605-613, 2023. https://doi.org/10.1007/s11633-023-1469-x
Citation: Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz Khan, Luc Van Gool. How Good is Google Bard′s Visual Understanding? An Empirical Study on Open Challenges. Machine Intelligence Research, vol. 20, no. 5, pp.605-613, 2023. https://doi.org/10.1007/s11633-023-1469-x

How Good is Google Bard′s Visual Understanding? An Empirical Study on Open Challenges

doi: 10.1007/s11633-023-1469-x
More Information
  • Author Bio:

    Haotong Qin received the B. Eng. degree in computer science and engineering from Beihang University, China. He is a Ph. D. degree candidate at Beihang University, China and visiting scholar at ETH Zürich, Switzerland. His research interests include hardware-friendly deep learning and neural network quantization. And his research goal is to enable state-of-the-art neural network models to be deployed on resource-limited hardware, which includes the compression and acceleration for multiple architectures, and the flexible and efficient deployment on multiple hardware.E-mail: qinhaotong@gmail.comORCID iD: 0000-0001-7391-7539

    Ge-Peng Ji received the M. Sc. degree in communication and information systems from Wuhan University, China in 2021. He is currently a Ph.D. degree candidate at College of Engineering, Computing & Cybernetics (CECC), the Australian National University (ANU), Australia. His research interests lie in designing deep neural networks and applying deep learning in various fields of low-level vision, such as RGB salient object detection, RGB-D salient object detection, video object segmentation, concealed object detection, and medical image segmentation.E-mail: gepengai.ji@gmail.comORCID iD: 0000-0001-7092-2877

    Salman Khan is an associate professor of Computer Vision at Mohamed bin Zayed University of Artificial Intelligence, UAE. He has been actively working on learning from limited data (zero and few-shot learning), adversarial robustness of deep neural networks and continual life-long learning systems for computer vision problems. His research interests include computer vision and machine learning.E-mail: salman.khan@mbzuai.ac.aeORCID iD: 0000-0002-9502-1749

    Deng-Ping Fan received the Ph.D. degree from Nankai University, China in 2019. He joined ETH Zürich, Switzerland in 2022. He has published about 25 top journal and conference papers such as TPAMI, CVPR, ICCV, ECCV, etc. He won the Best Paper Finalist Award at IEEE CVPR 2019, and the Best Paper Award Nominee at IEEE CVPR 2020.His research interests include computer vision and visual attention, especially in RGB salient object detection (SOD), RGB-D SOD, Video SOD, and CoSOD. E-mail: dengpfan@gmail.com (Corresponding author)ORCID: 0000-0002-5245-7518

    Fahad Shahbaz Khan is currently a full professor and deputy department chair of Computer Vision at Mohamed bin Zayed University of Artificial Intelligence, UAE. He received the best paper award in the computer vision track at IEEE ICPR 2016. He has published over 100 reviewed conference papers, journal articles, and book contributions, with over 30000 citations according to Google Scholar. He serves as a regular senior program committee member for leading conferences such as CVPR, ICCV, and ECCV.His research interests include a wide range of topics within computer vision and machine learning. E-mail: fahad.khan@mbzuai.ac.aeORCID iD: 0000-0002-4263-3143

    Luc Van Gool received the B. Eng. degree in electromechanical engineering from the Katholieke Universiteit Leuven in 1981. Currently, he is a professor at the Katholieke Universiteit Leuven in Belgium and the ETH Zürich, Switzerland. He leads computer vision research at both places and also teaches at both. He has been a program committee member of several major computer vision conferences. He received several Best Paper awards, won a David Marr Prize and a Koenderink Award, and was nominated Distinguished Researcher by the IEEE Computer Science Committee. He is a co-founder of 10 spin-off companies. His research interests include 3D reconstruction and modeling, object recognition, tracking, gesture analysis, and a combination of those. E-mail: vangool@vision.ee.ethz.ch ORCID iD: 0000-0002-3445-5711

  • Corresponding author: * Corresponding author (dengpfan@gmain.com)
  • Received Date: 2023-08-07
  • Accepted Date: 2023-08-16
  • Publish Online: 2023-08-30
  • Publish Date: 2023-10-01
  • Google′s Bard has emerged as a formidable competitor to OpenAI′s ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard′s impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard′s performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.

     

  • loading
  • [1]
    R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. G. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. P. Huang, M. Krikun, D. Lepikhin, J. Qin, D. H. Chen, Y Z. Xu, Z F. Chen, A. Roberts, M. Bosma, V. Zhao, Y. Q. Zhou, C. C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-Arcas, C. Cui, M. Croak, E. Chi, Q. Le. LaMDA: Language models for dialog applications. [Online], Available: https://arxiv.org/abs/2201.08239, 2022.
    [2]
    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. S. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. C. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. W. Zhou, X. Z. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel. PaLM: Scaling language modeling with pathways. [Online], Available: https://arxiv.org/abs/2204.02311, 2022.
    [3]
    OpenAI. GPT-4 technical report, [Online], Available: https://arxiv.org/abs/2303.08774, 2023
    [4]
    Microsoft. Bing chat enterprise announced, multimodal visual search rolling out to bing chat, [Online], Available: https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat, 2023.
    [5]
    LLaVA. LLaVA-Bench, [Online], Available: https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_Bench.md, 2023.
    [6]
    T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zürich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
    [7]
    D. Hendrycks, T. D. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [8]
    S. Y. Li, I. B. Araujo, W. Q. Ren, Z. Y. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. W. Zhang, X. J. Guo, X. C. Cao. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 3833–3842, 2019. DOI: 10.1109/CVPR.2019.00396.
    [9]
    S. Z. Hassan, K. Ahmad, S. Hicks, P. Halvorsen, A. Al-Fuqaha, N. Conci, M. Riegler. Visual sentiment analysis from disaster images in social media. Sensors, vol. 22, no. 10, Article number 3628, 2022. DOI: 10.3390/s22103628.
    [10]
    S. Maji, E. Rahtu, J. Kannala, M. Blaschko, A. Vedaldi, Fine-grained visual classification of aircraft, [Online], Available: https://arxiv.org/abs/1306.5151, 2013.
    [11]
    D. P. Fan, G. P. Ji, M. M. Cheng, L. Shao. Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6024–6042, 2022. DOI: 10.1109/TPAMI.2021.3085766.
    [12]
    G. L. Sun, Z. C. An, Y. Liu, C. Liu, C. Sakaridis, D. P. Fan, L. Van Gool. Indiscernible object counting in underwater scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Ganada, pp. 13791–13801, 2023.
    [13]
    D. P. Fan, G. P. Ji, P. Xu, M. M. Cheng, C. Sakaridis, L. Van Gool. Advances in deep concealed scene understanding. Visual Intelligence, vol. 1, no. 1, Article number 16, 2023. DOI: 10.1007/s44267-023-00019-6.
    [14]
    A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. L. Chen, D. Batra, D. Parikh, M. Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 8309–8318, 2019. DOI: 10.1109/CVPR.2019.00851.
    [15]
    G. P. Ji, G. B. Xiao, Y. C. Chou, D. P. Fan, K. Zhao, G. Chen, L. Van Gool. Video polyp segmentation: A deep learning perspective. Machine Intelligence Research, vol. 19, no. 6, pp. 531–549, 2022. DOI: 10.1007/s11633-022-1371-y.
    [16]
    S. Lobry, D. Marcos, J. Murray, D. Tuia. RSVQA: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020. DOI: 10.1109/tgrs.2020.2988782.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(11)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (101) PDF downloads(10) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return