Qi Zheng, Chao-Yue Wang, Dadong Wang, Da-Cheng Tao. Visual Superordinate Abstraction for Robust Concept Learning. Machine Intelligence Research, vol. 20, no. 1, pp.79-91, 2023. https://doi.org/10.1007/s11633-022-1360-1
Citation: Qi Zheng, Chao-Yue Wang, Dadong Wang, Da-Cheng Tao. Visual Superordinate Abstraction for Robust Concept Learning. Machine Intelligence Research, vol. 20, no. 1, pp.79-91, 2023. https://doi.org/10.1007/s11633-022-1360-1

Visual Superordinate Abstraction for Robust Concept Learning

doi: 10.1007/s11633-022-1360-1
More Information
  • Author Bio:

    Qi Zheng received the B. Eng. degree and M.Phil. degree in electronic information engineering from Huazhong University of Science and Technology, China in 2016 and 2019, respectively. She is currently a Ph.D. degree candidate in computer science at University of Sydney, Australia.Her research interests include multi-modal learning and scene understanding.E-mail: qzhe6525@uni.sydney.edu.au (Corresponding author)ORCID iD: 0000-0002-4351-9537

    Chao-Yue Wang received the B. Eng. degree in information engineering from Tianjin University, China in 2014, and the Ph. D. degree in information technology from University of Technology Sydney, Australia in 2018. He was a postdoctoral researcher in machine learning and computer vision at School of Computer Science, University of Sydney, Australia. He is currently is a research scientist at JD Explore Academy (JD.com). His research outcomes have been published in prestigious journals and prominent conferences, such as IEEE T-PAMI, IEEE T-EVC, IEEE T-IP, NeurIPS, CVPR, ECCV, IJCAI. He received the Distinguished Student Paper Award in the 2017 International Joint Conference on Artificial Intelligence (IJCAI-17). His research interests include developing deep learning techniques to solve real-world challenges, such as image synthesis/editing, controllable video generation, image/video enhancement, and medical image processing.E-mail: chaoyue.wang@sydney.edu.au

    Dadong Wang received the B. Eng. degree in mechanical engineering and M. Eng. and Ph. D. degrees in AI in machine fault diagnosis from University of Science and Technology, China, in 1990, 1993, and 1997, respectively. Then he received Ph. D. degree in AI in process optimization from University of Wollongong, Australia in 2002. He is a principal research scientist & the leader of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Quantitative Imaging Research Team, part of the CSIRO Data61, and a conjoint professor at University of New South Wales (UNSW) and an adjunct professor at the University of Technology, Sydney (UTS). Prior to joining the CSIRO in 2005, he had worked for two multinational companies for six years, developing large intelligent systems for monitoring and control. He has published over 150 research papers, book chapters and reports. His research team has been the recipient of Research Achievement Awards by CSIRO, the Engineering Excellence Award by Engineers Australia, R&D category of NSW, Queensland and ACT iAwards. He has been developing automated image analysis solutions for scientific and industrial applications, with the aim of increasing both quality and quantity of information extracted from multi-dimensional image data.His research interests include image analysis, computer vision, artificial intelligence, signal processing and software engineering. E-mail: dadong.wang@csiro.au

    Da-Cheng Tao received the B. Eng. degree in electronic information engineering from University of Science and Technology of China, in 2002, the M.Phil. degree in information engineering from the Chinese University of Hong Kong, China in 2004, and the Ph. D. degree in computer science and information systems from University of London, UK in 2007. He is a professor of computer science and an ARC laureate fellow with School of Computer Science and the Faculty of Engineering, University of Sydney, Australia. He mainly applies statistics and mathematics to artificial intelligence and data science. His research is detailed in one monograph and over 200 publications in prestigious journals and proceedings at prominent conferences such as IEEE TPAMI, TIP, TNNLS, IJCV, JMLR, NIPS, ICML, CVPR, ICCV, ECCV, AAAI, IJCAI, ICDM and ACM SIGKDD, with several best paper awards, such as the Best Theory/Algorithm Paper Runner Up Award at IEEE ICDM′07, the Distinguished Paper Award at 2018 IJCAI, the 2014 ICDM 10-year Highest-Impact Paper Award, and the 2017 IEEE Signal Processing Society Best Paper Award. He received the 2015 Australian Scopus-Eureka Prize and the 2018 IEEE ICDM Research Contributions Award. He is a fellow of the Australian Academy of Science, AAAS, ACM and IEEE.His research interests include applying statistics and mathematics to artificial intelligence and data science.E-mail: dacheng.tao@sydney.edu.auORCID iD: 0000-0001-7225-5449

  • Received Date: 2022-05-29
  • Accepted Date: 2022-07-21
  • Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···} $\in$ “color” subspace yet cube $\in$ “shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.

     

  • loading
  • [1]
    B. Inhelder, J. Piaget. The early growth of logic in the child: Classification and seriation. Routledge, vol. 83, 2013. DOI: 10.4324/9781315009667.
    [2]
    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: 10.1109/ICCV.2015.279
    [3]
    R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6720–6731, 2019. DOI: 10.1109/CVPR.2019.00688
    [4]
    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 3674–3683, 2018. DOI: 10.1109/CVPR.2018.00387
    [5]
    D. Mascharka, P. Tran, R. Soklaski, A. Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4942–4950, 2018. DOI: 10.1109/CVPR.2018.00519
    [6]
    K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, J. B. Tenenbaum. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Proceedings of Advances in Neural Information Processing Systems, Montréal, Canada, vol. 31, 2018.
    [7]
    J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proceedings of International Conference on Learning Representations, New Orleans, USA, 2019.
    [8]
    J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. L. Zitnick, R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 2901–2910, 2017. DOI: 10.1109/CVPR.2017.215
    [9]
    V. Marois, T. Jayram, V. Albouy, T. Kornuta, Y. Bouhadjar, A. S. Ozcan. On transfer learning using a mac model variant. In Proceedings of Workshop of Advances in Neural Information Processing Systems, Montréal, Canada, 2018.
    [10]
    G. Murphy. The Big Book of Concepts. Cambridge, USA: MIT press, 2004. DOI: 10.7551/mitpress/1602.001.0001
    [11]
    T. K. Landauer, S. T. Dumais. A solution to plato′s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, vol. 104, no. 2, Article number 211, 1997. DOI: 10.1037/0033-295X.104.2.211.
    [12]
    K. Lund, C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods,instruments &computers, vol. 28, no. 2, pp. 203–208, 1996. DOI: 10.3758/BF03204766.
    [13]
    B. M. Lake, G. L. Murphy. Word meaning in minds and machines. Psychological Review, to be published.
    [14]
    J. B. Tenenbaum, C. Kemp, T. L. Griffiths, N. D. Goodman. How to grow a mind: Statistics, structure, and abstraction. Science, vol. 331, no. 6022, pp. 1279–1285, 2011. DOI: 10.1126/science.1192788.
    [15]
    E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, P. Boyes-Braem. Basic objects in natural categories. Cognitive psychology, vol. 8, no. 3, pp. 382–439, 1976. DOI: 10.1016/0010-0285(76)90013-X.
    [16]
    J. W. Tanaka, M. Taylor. Object categories and expertise: Is the basic level in the eye of the beholder? Cognitive psychology, vol. 23, no. 3, pp. 457–482, 1991. DOI: 10.1016/0010-0285(91)90016-H.
    [17]
    C. Han, J. Mao, C. Gan, J. B. Tenenbaum, J. Wu. Visual concept-metaconcept learning. In Proceedings of Advances in Neural Information Processing Systems, Vancouver, Canada, 2019.
    [18]
    A. Li, K. Zhang, L. Wang. Zero-shot fine-grained classification by deep feature learning with semantics. International Journal of Automation and Computing, vol. 16, no. 5, pp. 563–574, 2019. DOI: 10.1007/s11633-019-1177-8.
    [19]
    W. Zhu, W. Sun, X. Min, G. Zhai, X. Yang. Structured computational modeling of human visual system for no-reference image quality assessment. International Journal of Automation and Computing, vol. 18, no. 2, pp. 204–218, 2021. DOI: 10.1007/s11633-020-1270-z.
    [20]
    J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick. Inferring and executing programs for visual reasoning. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 2989–2998, 2017. DOI: 10.1109/ICCV.2017.325
    [21]
    K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90
    [22]
    R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 804–813, 2017. DOI: 10.1109/ICCV.2017.93
    [23]
    R. Hu, J. Andreas, T. Darrell, K. Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision, Springer, Munich, Germany, pp. 53–69, 2018. DOI: 10.1007/978-3-030-01234-2_4
    [24]
    Z. Chen, J. Mao, J. Wu, K. Wong, J. Tenenbaum, C. Gan. Grounding physical concepts of objects and events through dynamic visual reasoning. In Proceedings of International Conference on Learning Representations, Vienna, Austria, 2021.
    [25]
    Q. Li, S. Huang, Y. Hong, S.-C. Zhu. A competence-aware curriculum for visual concepts learning via question answering. In Proceedings of the European Conference on Computer Vision, Springer, pp. 141–157, 2020. DOI: 10.1007/978-3-030-58536-5_9
    [26]
    E. Perez, F. Strub, H. De Vries, V. Dumoulin, A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of AAAI Conference on Artificial Intelligence, New Orleans, USA, pp.3942–3951, 2018. DOI: 10.1609/aaai.v32i1.11671
    [27]
    D. A. Hudson, C. D. Manning. Compositional attention networks for machine reasoning. In Proceedings of International Conference on Learning Representations, Vancouver, Canada, 2018.
    [28]
    Z. Wang, K. Wang, M. Yu, J. Xiong, W. Hwu, M. Hasegawa-Johnson, H. Shi. Interpretable visual reasoning via induced symbolic space. In Proceedings of IEEE International Conference on Computer Vision, Montréal, Canada, pp. 1878–1887, 2021. DOI: 10.1109/ICCV48922.2021.00189
    [29]
    A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, N. Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of IEEE International Conference on Computer Vision, Montréal, Canada, pp.1780–1790, 2021. DOI: 10.1109/ICCV48922.2021.00180
    [30]
    J. Pearl. Causal inference in statistics: An overview. Statistics Surveys, vol. 3, pp. 96–146, 2009. DOI: 10.1214/09-SS057.
    [31]
    G. Dunn, R. Emsley, H. Liu, S. Landau, J. Green, I. White, A. Pickles. Evaluation and validation of social and psychological markers in randomised trials of complex interventions in mental health: a methodological research programme. Health Technology Assessment,Winchester,England, vol. 19, no. 93, pp. 1–115, 2015. DOI: 10.3310/hta19930.
    [32]
    B. G. King. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly, vol. 53, no. 3, pp. 395–421, 2008. DOI: 10.2189/asqu.53.3.395.
    [33]
    D. P. MacKinnon, A. J. Fairchild, M. S. Fritz. Mediation analysis. Annual Review of Psychology, vol. 58, pp. 593–614, 2007. DOI: 10.1146/annurev.psych.58.110405.085542.
    [34]
    L. Richiardi, R. Bellocco, D. Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International Journal of Epidemiology, vol. 42, no. 5, pp. 1511–1519, 2013. DOI: 10.1093/ije/dyt127.
    [35]
    S. Nair, Y. Zhu, S. Savarese, L. Fei-Fei. Causal induction from visual observations for goal directed tasks. [Online], Available: https://arxiv.org/abs/1910.01751.
    [36]
    Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, J.-R. Wen. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp.12700–12710, 2021. DOI: 10.1109/CVPR46437.2021.01251.
    [37]
    J. Qi, Y. Niu, J. Huang, H. Zhang. Two causal principles for improving visual dialog. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10860–10869, 2020. DOI: 10.1109/CVPR42600.2020.01087
    [38]
    T. Wang, J. Huang, H. Zhang, Q. Sun. Visual commonsense R-CNN. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10760–10770, 2020. DOI: 10.1109/CVPR42600.2020.01077
    [39]
    X. Yang, H. Zhang, J. Cai. Deconfounded image captioning: A causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published, 2022.
    [40]
    K. Tang, Y. Niu, J. Huang, J. Shi, H. Zhang. Unbiased scene graph generation from biased training. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 3716–3725, 2020. DOI: 10.1109/CVPR42600.2020.00377
    [41]
    I. Loshchilov, F. Hutter. Decoupled weight decay regularization. In Proceedings of International Conference on Learning Representations, New Orleans, USA, 2019.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(11)  / Tables(3)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (19) PDF downloads(1) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return