Display Method:
, Available online ,
doi: 10.1007/s11633-022-1395-3
Abstract:
With the application of mobile communication technology in the automotive industry, intelligent connected vehicles equipped with communication and sensing devices have been rapidly promoted. The road and traffic information perceived by intelligent vehicles has important potential application value, especially for improving the energy-saving and safe-driving of vehicles as well as the efficient operation of traffic. Therefore, a type of vehicle control technology called predictive cruise control (PCC) has become a hot research topic. It fully taps the perceived or predicted environmental information to carry out predictive cruise control of vehicles and improves the comprehensive performance of the vehicle-road system. Most existing reviews focus on the economical driving of vehicles, but few scholars have conducted a comprehensive survey of PCC from theory to the status quo. In this paper, the methods and advances of PCC technologies are reviewed comprehensively by investigating the global literature, and typical applications under a cloud control system (CCS) are proposed. Firstly, the methodology of PCC is generally introduced. Then according to typical scenarios, the PCC-related research is deeply surveyed, including freeway and urban traffic scenarios involving traditional vehicles, new energy vehicles, intelligent vehicles, and multi-vehicle platoons. Finally, the general architecture and three typical applications of the cloud control system (CCS) on PCC are briefly introduced, and the prospect and future trends of PCC are proposed.
With the application of mobile communication technology in the automotive industry, intelligent connected vehicles equipped with communication and sensing devices have been rapidly promoted. The road and traffic information perceived by intelligent vehicles has important potential application value, especially for improving the energy-saving and safe-driving of vehicles as well as the efficient operation of traffic. Therefore, a type of vehicle control technology called predictive cruise control (PCC) has become a hot research topic. It fully taps the perceived or predicted environmental information to carry out predictive cruise control of vehicles and improves the comprehensive performance of the vehicle-road system. Most existing reviews focus on the economical driving of vehicles, but few scholars have conducted a comprehensive survey of PCC from theory to the status quo. In this paper, the methods and advances of PCC technologies are reviewed comprehensively by investigating the global literature, and typical applications under a cloud control system (CCS) are proposed. Firstly, the methodology of PCC is generally introduced. Then according to typical scenarios, the PCC-related research is deeply surveyed, including freeway and urban traffic scenarios involving traditional vehicles, new energy vehicles, intelligent vehicles, and multi-vehicle platoons. Finally, the general architecture and three typical applications of the cloud control system (CCS) on PCC are briefly introduced, and the prospect and future trends of PCC are proposed.
, Available online ,
doi: 10.1007/s11633-022-1382-8
Abstract:
Medical data refers to health-related information associated with regular patient care or as part of a clinical trial program. There are many categories of such data, such as clinical imaging data, bio-signal data, electronic health records (EHR), and multi-modality medical data. With the development of deep neural networks in the last decade, the emerging pre-training paradigm has become dominant in that it has significantly improved machine learning methods′ performance in a data-limited scenario. In recent years, studies of pre-training in the medical domain have achieved significant progress. To summarize these technology advancements, this work provides a comprehensive survey of recent advances for pre-training on several major types of medical data. In this survey, we summarize a large number of related publications and the existing benchmarking in the medical domain. Especially, the survey briefly describes how some pre-training methods are applied to or developed for medical data. From a data-driven perspective, we examine the extensive use of pre-training in many medical scenarios. Moreover, based on the summary of recent pre-training studies, we identify several challenges in this field to provide insights for future studies.
Medical data refers to health-related information associated with regular patient care or as part of a clinical trial program. There are many categories of such data, such as clinical imaging data, bio-signal data, electronic health records (EHR), and multi-modality medical data. With the development of deep neural networks in the last decade, the emerging pre-training paradigm has become dominant in that it has significantly improved machine learning methods′ performance in a data-limited scenario. In recent years, studies of pre-training in the medical domain have achieved significant progress. To summarize these technology advancements, this work provides a comprehensive survey of recent advances for pre-training on several major types of medical data. In this survey, we summarize a large number of related publications and the existing benchmarking in the medical domain. Especially, the survey briefly describes how some pre-training methods are applied to or developed for medical data. From a data-driven perspective, we examine the extensive use of pre-training in many medical scenarios. Moreover, based on the summary of recent pre-training studies, we identify several challenges in this field to provide insights for future studies.
, Available online ,
doi: 10.1007/s11633-022-1364-x
Abstract:
Lung cancer is the leading cause of cancer-related deaths worldwide. Medical imaging technologies such as computed tomography (CT) and positron emission tomography (PET) are routinely used for non-invasive lung cancer diagnosis. In clinical practice, physicians investigate the characteristics of tumors such as the size, shape and location from CT and PET images to make decisions. Recently, scientists have proposed various computational image features that can capture more information than that directly perceivable by human eyes, which promotes the rise of radiomics. Radiomics is a research field on the conversion of medical images into high-dimensional features with data-driven methods to help subsequent data mining for better clinical decision support. Radiomic analysis has four major steps: image preprocessing, tumor segmentation, feature extraction and clinical prediction. Machine learning, including the high-profile deep learning, facilitates the development and application of radiomic methods. Various radiomic methods have been proposed recently, such as the construction of radiomic signatures, tumor habitat analysis, cluster pattern characterization and end-to-end prediction of tumor properties. These methods have been applied in many studies aiming at lung cancer diagnosis, treatment and monitoring, shedding light on future non-invasive evaluations of the nodule malignancy, histological subtypes, genomic properties and treatment responses. In this review, we summarized and categorized the studies on the general workflow, methods for clinical prediction and clinical applications of machine learning in lung cancer radiomic studies, introduced some commonly-used software tools, and discussed the limitations of current methods and possible future directions.
Lung cancer is the leading cause of cancer-related deaths worldwide. Medical imaging technologies such as computed tomography (CT) and positron emission tomography (PET) are routinely used for non-invasive lung cancer diagnosis. In clinical practice, physicians investigate the characteristics of tumors such as the size, shape and location from CT and PET images to make decisions. Recently, scientists have proposed various computational image features that can capture more information than that directly perceivable by human eyes, which promotes the rise of radiomics. Radiomics is a research field on the conversion of medical images into high-dimensional features with data-driven methods to help subsequent data mining for better clinical decision support. Radiomic analysis has four major steps: image preprocessing, tumor segmentation, feature extraction and clinical prediction. Machine learning, including the high-profile deep learning, facilitates the development and application of radiomic methods. Various radiomic methods have been proposed recently, such as the construction of radiomic signatures, tumor habitat analysis, cluster pattern characterization and end-to-end prediction of tumor properties. These methods have been applied in many studies aiming at lung cancer diagnosis, treatment and monitoring, shedding light on future non-invasive evaluations of the nodule malignancy, histological subtypes, genomic properties and treatment responses. In this review, we summarized and categorized the studies on the general workflow, methods for clinical prediction and clinical applications of machine learning in lung cancer radiomic studies, introduced some commonly-used software tools, and discussed the limitations of current methods and possible future directions.
, Available online ,
doi: 10.1007/s11633-022-1347-y
Abstract:
Dialogue policy learning (DPL) is a key component in a task-oriented dialogue (TOD) system. Its goal is to decide the next action of the dialogue system, given the dialogue state at each turn based on a learned dialogue policy. Reinforcement learning (RL) is widely used to optimize this dialogue policy. In the learning process, the user is regarded as the environment and the system as the agent. In this paper, we present an overview of the recent advances and challenges in dialogue policy from the perspective of RL. More specifically, we identify the problems and summarize corresponding solutions for RL-based dialogue policy learning. In addition, we provide a comprehensive survey of applying RL to DPL by categorizing recent methods into five basic elements in RL. We believe this survey can shed light on future research in DPL.
Dialogue policy learning (DPL) is a key component in a task-oriented dialogue (TOD) system. Its goal is to decide the next action of the dialogue system, given the dialogue state at each turn based on a learned dialogue policy. Reinforcement learning (RL) is widely used to optimize this dialogue policy. In the learning process, the user is regarded as the environment and the system as the agent. In this paper, we present an overview of the recent advances and challenges in dialogue policy from the perspective of RL. More specifically, we identify the problems and summarize corresponding solutions for RL-based dialogue policy learning. In addition, we provide a comprehensive survey of applying RL to DPL by categorizing recent methods into five basic elements in RL. We believe this survey can shed light on future research in DPL.
, Available online ,
doi: 10.1007/s11633-022-1384-6
Abstract:
With the breakthrough of AlphaGo, human-computer gaming AI has ushered in a big explosion, attracting more and more researchers all over the world. As a recognized standard for testing artificial intelligence, various human-computer gaming AI systems (AIs) have been developed, such as Libratus, OpenAI Five, and AlphaStar, which beat professional human players. The rapid development of human-computer gaming AIs indicates a big step for decision-making intelligence, and it seems that current techniques can handle very complex human-computer games. So, one natural question arises: What are the possible challenges of current techniques in human-computer gaming and what are the future trends? To answer the above question, in this paper, we survey recent successful game AIs, covering board game AIs, card game AIs, first-person shooting game AIs, and real-time strategy game AIs. Through this survey, we 1) compare the main difficulties among different kinds of games and the corresponding techniques utilized for achieving professional human-level AIs; 2) summarize the mainstream frameworks and techniques that can be properly relied on for developing AIs for complex human-computer games; 3) raise the challenges or drawbacks of current techniques in the successful AIs; and 4) try to point out future trends in human-computer gaming AIs. Finally, we hope that this brief review can provide an introduction for beginners and inspire insight for researchers in the field of AI in human-computer gaming.
With the breakthrough of AlphaGo, human-computer gaming AI has ushered in a big explosion, attracting more and more researchers all over the world. As a recognized standard for testing artificial intelligence, various human-computer gaming AI systems (AIs) have been developed, such as Libratus, OpenAI Five, and AlphaStar, which beat professional human players. The rapid development of human-computer gaming AIs indicates a big step for decision-making intelligence, and it seems that current techniques can handle very complex human-computer games. So, one natural question arises: What are the possible challenges of current techniques in human-computer gaming and what are the future trends? To answer the above question, in this paper, we survey recent successful game AIs, covering board game AIs, card game AIs, first-person shooting game AIs, and real-time strategy game AIs. Through this survey, we 1) compare the main difficulties among different kinds of games and the corresponding techniques utilized for achieving professional human-level AIs; 2) summarize the mainstream frameworks and techniques that can be properly relied on for developing AIs for complex human-computer games; 3) raise the challenges or drawbacks of current techniques in the successful AIs; and 4) try to point out future trends in human-computer gaming AIs. Finally, we hope that this brief review can provide an introduction for beginners and inspire insight for researchers in the field of AI in human-computer gaming.
, Available online ,
doi: 10.1007/s11633-022-1409-1
Abstract:
Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
, Available online ,
doi: 10.1007/s11633-022-1363-y
Abstract:
For industrial processes, new scarce faults are usually judged by experts. The lack of instances for these faults causes a severe data imbalance problem for a diagnosis model and leads to low performance. In this article, a new diagnosis method with few-shot learning based on a class-rebalance strategy is proposed to handle the problem. The proposed method is designed to transform instances of the different faults into a feature embedding space. In this way, the fault features can be transformed into separate feature clusters. The fault representations are calculated as the centers of feature clusters. The representations of new faults can also be effectively calculated with few support instances. Therefore, fault diagnosis can be achieved by estimating feature similarity between instances and faults. A cluster loss function is designed to enhance the feature clustering performance. Also, a class-rebalance strategy with data augmentation is designed to imitate potential faults with different reasons and degrees of severity to improve the model′s generalizability. It improves the diagnosis performance of the proposed method. Simulations of fault diagnosis with the proposed method were performed on the Tennessee-Eastman benchmark. The proposed method achieved average diagnosis accuracies ranging from 81.8% to 94.7% for the eight selected faults for the simulation settings of support instances ranging from 3 to 50. The simulation results verify the effectiveness of the proposed method.
For industrial processes, new scarce faults are usually judged by experts. The lack of instances for these faults causes a severe data imbalance problem for a diagnosis model and leads to low performance. In this article, a new diagnosis method with few-shot learning based on a class-rebalance strategy is proposed to handle the problem. The proposed method is designed to transform instances of the different faults into a feature embedding space. In this way, the fault features can be transformed into separate feature clusters. The fault representations are calculated as the centers of feature clusters. The representations of new faults can also be effectively calculated with few support instances. Therefore, fault diagnosis can be achieved by estimating feature similarity between instances and faults. A cluster loss function is designed to enhance the feature clustering performance. Also, a class-rebalance strategy with data augmentation is designed to imitate potential faults with different reasons and degrees of severity to improve the model′s generalizability. It improves the diagnosis performance of the proposed method. Simulations of fault diagnosis with the proposed method were performed on the Tennessee-Eastman benchmark. The proposed method achieved average diagnosis accuracies ranging from 81.8% to 94.7% for the eight selected faults for the simulation settings of support instances ranging from 3 to 50. The simulation results verify the effectiveness of the proposed method.
, Available online ,
doi: 10.1007/s11633-022-1358-8
Abstract:
Few-shot learning (FSL) aims to learn novel concepts from very limited examples. However, most FSL methods suffer from the issue of lacking robustness in concept learning. Specifically, existing FSL methods usually ignore the diversity of region contents that may contain concept-irrelevant information such as the background, which would introduce bias/noise and degrade the performance of conceptual representation learning. To address the above-mentioned issue, we propose a novel metric-based FSL method termed region-adaptive concept aggregation network or RCA-Net. Specifically, we devise a region-adaptive concept aggregator (RCA) to model the relationships of different regions and capture the conceptual information in different regions, which are then integrated in a weighted average manner to obtain the conceptual representation. Consequently, robust concept learning can be achieved by focusing more on the concept-relevant information and less on the conceptual-irrelevant information. We perform extensive experiments on three popular visual recognition benchmarks to demonstrate the superiority of RCA-Net for robust few-shot learning. In particular, on the Caltech-UCSD Birds-200-2011 (CUB200) dataset, the proposed RCA-Net significantly improves 1-shot accuracy from 74.76% to 78.03% and 5-shot accuracy from 86.84% to 89.83% compared with the most competitive counterpart.
Few-shot learning (FSL) aims to learn novel concepts from very limited examples. However, most FSL methods suffer from the issue of lacking robustness in concept learning. Specifically, existing FSL methods usually ignore the diversity of region contents that may contain concept-irrelevant information such as the background, which would introduce bias/noise and degrade the performance of conceptual representation learning. To address the above-mentioned issue, we propose a novel metric-based FSL method termed region-adaptive concept aggregation network or RCA-Net. Specifically, we devise a region-adaptive concept aggregator (RCA) to model the relationships of different regions and capture the conceptual information in different regions, which are then integrated in a weighted average manner to obtain the conceptual representation. Consequently, robust concept learning can be achieved by focusing more on the concept-relevant information and less on the conceptual-irrelevant information. We perform extensive experiments on three popular visual recognition benchmarks to demonstrate the superiority of RCA-Net for robust few-shot learning. In particular, on the Caltech-UCSD Birds-200-2011 (CUB200) dataset, the proposed RCA-Net significantly improves 1-shot accuracy from 74.76% to 78.03% and 5-shot accuracy from 86.84% to 89.83% compared with the most competitive counterpart.
, Available online ,
doi: 10.1007/s11633-022-1377-5
Abstract:
The pre-training-then-fine-tuning paradigm has been widely used in deep learning. Due to the huge computation cost for pre-training, practitioners usually download pre-trained models from the Internet and fine-tune them on downstream datasets, while the downloaded models may suffer backdoor attacks. Different from previous attacks aiming at a target task, we show that a backdoored pre-trained model can behave maliciously in various downstream tasks without foreknowing task information. Attackers can restrict the output representations (the values of output neurons) of trigger-embedded samples to arbitrary predefined values through additional training, namely neuron-level backdoor attack (NeuBA). Since fine-tuning has little effect on model parameters, the fine-tuned model will retain the backdoor functionality and predict a specific label for the samples embedded with the same trigger. To provoke multiple labels in a specific task, attackers can introduce several triggers with predefined contrastive values. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA can well control the predictions for trigger-embedded instances with different trigger designs. Our findings sound a red alarm for the wide use of pre-trained models. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising technique to resist NeuBA by omitting backdoored neurons.
The pre-training-then-fine-tuning paradigm has been widely used in deep learning. Due to the huge computation cost for pre-training, practitioners usually download pre-trained models from the Internet and fine-tune them on downstream datasets, while the downloaded models may suffer backdoor attacks. Different from previous attacks aiming at a target task, we show that a backdoored pre-trained model can behave maliciously in various downstream tasks without foreknowing task information. Attackers can restrict the output representations (the values of output neurons) of trigger-embedded samples to arbitrary predefined values through additional training, namely neuron-level backdoor attack (NeuBA). Since fine-tuning has little effect on model parameters, the fine-tuned model will retain the backdoor functionality and predict a specific label for the samples embedded with the same trigger. To provoke multiple labels in a specific task, attackers can introduce several triggers with predefined contrastive values. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA can well control the predictions for trigger-embedded instances with different trigger designs. Our findings sound a red alarm for the wide use of pre-trained models. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising technique to resist NeuBA by omitting backdoored neurons.
, Available online ,
doi: 10.1007/s11633-022-1394-4
Abstract:
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available athttps://github.com/GewelsJI/MVLT .
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at
, Available online ,
doi: 10.1007/s11633-022-1387-3
Abstract:
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. However, previous works mainly focus on showing and evaluating the conversational performance of the released dialogue model, ignoring the discussion of some key factors towards a powerful human-like chatbot, especially in Chinese scenarios. In this paper, we conduct extensive experiments to investigate these under-explored factors, including data quality control, model architecture designs, training approaches, and decoding strategies. We propose EVA2.0, a large-scale pre-trained open-domain Chinese dialogue model with 2.8 billion parameters, and will make our models and codes publicly available. Automatic and human evaluations show that EVA2.0 significantly outperforms other open-source counterparts. We also discuss the limitations of this work by presenting some failure cases and pose some future research directions on large-scale Chinese open-domain dialogue systems.
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. However, previous works mainly focus on showing and evaluating the conversational performance of the released dialogue model, ignoring the discussion of some key factors towards a powerful human-like chatbot, especially in Chinese scenarios. In this paper, we conduct extensive experiments to investigate these under-explored factors, including data quality control, model architecture designs, training approaches, and decoding strategies. We propose EVA2.0, a large-scale pre-trained open-domain Chinese dialogue model with 2.8 billion parameters, and will make our models and codes publicly available. Automatic and human evaluations show that EVA2.0 significantly outperforms other open-source counterparts. We also discuss the limitations of this work by presenting some failure cases and pose some future research directions on large-scale Chinese open-domain dialogue systems.
, Available online ,
doi: 10.1007/s11633-022-1374-8
Abstract:
Recent years have witnessed the great success of self-supervised learning (SSL) in recommendation systems. However, SSL recommender models are likely to suffer from spurious correlations, leading to poor generalization. To mitigate spurious correlations, existing work usually pursues ID-based SSL recommendation or utilizes feature engineering to identify spurious features. Nevertheless, ID-based SSL approaches sacrifice the positive impact of invariant features, while feature engineering methods require high-cost human labeling. To address the problems, we aim to automatically mitigate the effect of spurious correlations. This objective requires to 1) automatically mask spurious features without supervision, and 2) block the negative effect transmission from spurious features to other features during SSL. To handle the two challenges, we propose an invariant feature learning framework, which first divides user-item interactions into multiple environments with distribution shifts and then learns a feature mask mechanism to capture invariant features across environments. Based on the mask mechanism, we can remove the spurious features for robust predictions and block the negative effect transmission via mask-guided feature augmentation. Extensive experiments on two datasets demonstrate the effectiveness of the proposed framework in mitigating spurious correlations and improving the generalization abilities of SSL models.
Recent years have witnessed the great success of self-supervised learning (SSL) in recommendation systems. However, SSL recommender models are likely to suffer from spurious correlations, leading to poor generalization. To mitigate spurious correlations, existing work usually pursues ID-based SSL recommendation or utilizes feature engineering to identify spurious features. Nevertheless, ID-based SSL approaches sacrifice the positive impact of invariant features, while feature engineering methods require high-cost human labeling. To address the problems, we aim to automatically mitigate the effect of spurious correlations. This objective requires to 1) automatically mask spurious features without supervision, and 2) block the negative effect transmission from spurious features to other features during SSL. To handle the two challenges, we propose an invariant feature learning framework, which first divides user-item interactions into multiple environments with distribution shifts and then learns a feature mask mechanism to capture invariant features across environments. Based on the mask mechanism, we can remove the spurious features for robust predictions and block the negative effect transmission via mask-guided feature augmentation. Extensive experiments on two datasets demonstrate the effectiveness of the proposed framework in mitigating spurious correlations and improving the generalization abilities of SSL models.
, Available online ,
doi: 10.1007/s11633-022-1346-z
Abstract:
In this article, a robot skills learning framework is developed, which considers both motion modeling and execution. In order to enable the robot to learn skills from demonstrations, a learning method called dynamic movement primitives (DMPs) is introduced to model motion. A staged teaching strategy is integrated into DMPs frameworks to enhance the generality such that the complicated tasks can be also performed for multi-joint manipulators. The DMP connection method is used to make an accurate and smooth transition in position and velocity space to connect complex motion sequences. In addition, motions are categorized into different goals and durations. It is worth mentioning that an adaptive neural networks (NNs) control method is proposed to achieve highly accurate trajectory tracking and to ensure the performance of action execution, which is beneficial to the improvement of reliability of the skills learning system. The experiment test on the Baxter robot verifies the effectiveness of the proposed method.
In this article, a robot skills learning framework is developed, which considers both motion modeling and execution. In order to enable the robot to learn skills from demonstrations, a learning method called dynamic movement primitives (DMPs) is introduced to model motion. A staged teaching strategy is integrated into DMPs frameworks to enhance the generality such that the complicated tasks can be also performed for multi-joint manipulators. The DMP connection method is used to make an accurate and smooth transition in position and velocity space to connect complex motion sequences. In addition, motions are categorized into different goals and durations. It is worth mentioning that an adaptive neural networks (NNs) control method is proposed to achieve highly accurate trajectory tracking and to ensure the performance of action execution, which is beneficial to the improvement of reliability of the skills learning system. The experiment test on the Baxter robot verifies the effectiveness of the proposed method.
, Available online ,
doi: 10.1007/s11633-022-1373-9
Abstract:
Web search provides a promising way for people to obtain information and has been extensively studied. With the surge of deep learning and large-scale pre-training techniques, various neural information retrieval models are proposed, and they have demonstrated the power for improving search (especially, the ranking) quality. All these existing search methods follow a common paradigm, i.e., index-retrieve-rerank, where they first build an index of all documents based on document terms (i.e., sparse inverted index) or representation vectors (i.e., dense vector index), then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models. In this paper, we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model. Instead, all of the knowledge of the documents is encoded into model parameters, which can be regarded as a differentiable indexer and optimized in an end-to-end manner. Specifically, we propose a pre-trained model-based information retrieval (IR) system called DynamicRetriever, which directly returns document identifiers for a given query. Under such a framework, we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models. Compared with existing search methods, the model-based IR system parameterizes the traditional static index with a pre-training model, which converts the document semantic mapping into a dynamic and updatable process. Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension (MS MARCO) verify the effectiveness and potential of our proposed new paradigm for information retrieval.
Web search provides a promising way for people to obtain information and has been extensively studied. With the surge of deep learning and large-scale pre-training techniques, various neural information retrieval models are proposed, and they have demonstrated the power for improving search (especially, the ranking) quality. All these existing search methods follow a common paradigm, i.e., index-retrieve-rerank, where they first build an index of all documents based on document terms (i.e., sparse inverted index) or representation vectors (i.e., dense vector index), then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models. In this paper, we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model. Instead, all of the knowledge of the documents is encoded into model parameters, which can be regarded as a differentiable indexer and optimized in an end-to-end manner. Specifically, we propose a pre-trained model-based information retrieval (IR) system called DynamicRetriever, which directly returns document identifiers for a given query. Under such a framework, we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models. Compared with existing search methods, the model-based IR system parameterizes the traditional static index with a pre-training model, which converts the document semantic mapping into a dynamic and updatable process. Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension (MS MARCO) verify the effectiveness and potential of our proposed new paradigm for information retrieval.
, Available online ,
doi: 10.1007/s11633-022-1372-x
Abstract:
Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text generation tasks. To fill this research gap, we propose to using GPLMs to promote the performance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.
Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text generation tasks. To fill this research gap, we propose to using GPLMs to promote the performance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.
, Available online ,
doi: 10.1007/s11633-022-1366-8
Abstract:
Photoplethysmography (PPG) biometrics have received considerable attention. Although deep learning has achieved good performance for PPG biometrics, several challenges remain open: 1) How to effectively extract the feature fusion representation from time and frequency PPG signals. 2) How to effectively capture a series of PPG signal transition information. 3) How to extract time-varying information from one-dimensional time-frequency sequential data. To address these challenges, we propose a dual-domain and multiscale fusion deep neural network (DMFDNN) for PPG biometric recognition. The DMFDNN is mainly composed of a two-branch deep learning framework for PPG biometrics, which can learn the time-varying and multiscale discriminative features from the time and frequency domains. Meanwhile, we design a multiscale extraction module to capture transition information, which consists of multiple convolution layers with different receptive fields for capturing multiscale transition information. In addition, the dual-domain attention module is proposed to strengthen the domain of greater contributions from time-domain and frequency-domain data for PPG biometrics. Experiments on the four datasets demonstrate that DMFDNN outperforms the state-of-the-art methods for PPG biometrics.
Photoplethysmography (PPG) biometrics have received considerable attention. Although deep learning has achieved good performance for PPG biometrics, several challenges remain open: 1) How to effectively extract the feature fusion representation from time and frequency PPG signals. 2) How to effectively capture a series of PPG signal transition information. 3) How to extract time-varying information from one-dimensional time-frequency sequential data. To address these challenges, we propose a dual-domain and multiscale fusion deep neural network (DMFDNN) for PPG biometric recognition. The DMFDNN is mainly composed of a two-branch deep learning framework for PPG biometrics, which can learn the time-varying and multiscale discriminative features from the time and frequency domains. Meanwhile, we design a multiscale extraction module to capture transition information, which consists of multiple convolution layers with different receptive fields for capturing multiscale transition information. In addition, the dual-domain attention module is proposed to strengthen the domain of greater contributions from time-domain and frequency-domain data for PPG biometrics. Experiments on the four datasets demonstrate that DMFDNN outperforms the state-of-the-art methods for PPG biometrics.
, Available online ,
doi: 10.1007/s11633-022-1341-4
Abstract:
Most finger vein authentication systems suffer from the problem of small sample size. However, the data augmentation can alleviate this problem to a certain extent but did not fundamentally solve the problem of category diversity. So the researchers resort to pre-training or multi-source data joint training methods, but these methods will lead to the problem of user privacy leakage. In view of the above issues, this paper proposes a federated learning-based finger vein authentication framework (FedFV) to solve the problem of small sample size and category diversity while protecting user privacy. Through training under FedFV, each client can share the knowledge learned from its user′s finger vein data with the federated client without causing template leaks. In addition, we further propose an efficient personalized federated aggregation algorithm, named federated weighted proportion reduction (FedWPR), to tackle the problem of non-independent identically distribution caused by client diversity, thus achieving the best performance for each client. To thoroughly evaluate the effectiveness of FedFV, comprehensive experiments are conducted on nine publicly available finger vein datasets. Experimental results show that FedFV can improve the performance of the finger vein authentication system without directly using other client data. To the best of our knowledge, FedFV is the first personalized federated finger vein authentication framework, which has some reference value for subsequent biometric privacy protection research.
Most finger vein authentication systems suffer from the problem of small sample size. However, the data augmentation can alleviate this problem to a certain extent but did not fundamentally solve the problem of category diversity. So the researchers resort to pre-training or multi-source data joint training methods, but these methods will lead to the problem of user privacy leakage. In view of the above issues, this paper proposes a federated learning-based finger vein authentication framework (FedFV) to solve the problem of small sample size and category diversity while protecting user privacy. Through training under FedFV, each client can share the knowledge learned from its user′s finger vein data with the federated client without causing template leaks. In addition, we further propose an efficient personalized federated aggregation algorithm, named federated weighted proportion reduction (FedWPR), to tackle the problem of non-independent identically distribution caused by client diversity, thus achieving the best performance for each client. To thoroughly evaluate the effectiveness of FedFV, comprehensive experiments are conducted on nine publicly available finger vein datasets. Experimental results show that FedFV can improve the performance of the finger vein authentication system without directly using other client data. To the best of our knowledge, FedFV is the first personalized federated finger vein authentication framework, which has some reference value for subsequent biometric privacy protection research.
, Available online ,
doi: 10.1007/s11633-022-1345-0
Abstract:
Electrocardiogram (ECG) biometric recognition has gained considerable attention, and various methods have been proposed to facilitate its development. However, one limitation is that the diversity of ECG signals affects the recognition performance. To address this issue, in this paper, we propose a novel ECG biometrics framework based on enhanced correlation and semantic-rich embedding. Firstly, we construct an enhanced correlation between the base feature and latent representation by using only one projection. Secondly, to fully exploit the semantic information, we take both the label and pairwise similarity into consideration to reduce the influence of ECG sample diversity. Furthermore, to solve the objective function, we propose an effective and efficient algorithm for optimization. Finally, extensive experiments are conducted on two benchmark datasets, and the experimental results show the effectiveness of our framework.
Electrocardiogram (ECG) biometric recognition has gained considerable attention, and various methods have been proposed to facilitate its development. However, one limitation is that the diversity of ECG signals affects the recognition performance. To address this issue, in this paper, we propose a novel ECG biometrics framework based on enhanced correlation and semantic-rich embedding. Firstly, we construct an enhanced correlation between the base feature and latent representation by using only one projection. Secondly, to fully exploit the semantic information, we take both the label and pairwise similarity into consideration to reduce the influence of ECG sample diversity. Furthermore, to solve the objective function, we propose an effective and efficient algorithm for optimization. Finally, extensive experiments are conducted on two benchmark datasets, and the experimental results show the effectiveness of our framework.
, Available online ,
doi: 10.1007/s11633-022-1328-1
Abstract:
Adversarial example has been well known as a serious threat to deep neural networks (DNNs). In this work, we study the detection of adversarial examples based on the assumption that the output and internal responses of one DNN model for both adversarial and benign examples follow the generalized Gaussian distribution (GGD) but with different parameters (i.e., shape factor, mean, and variance). GGD is a general distribution family that covers many popular distributions (e.g., Laplacian, Gaussian, or uniform). Therefore, it is more likely to approximate the intrinsic distributions of internal responses than any specific distribution. Besides, since the shape factor is more robust to different databases rather than the other two parameters, we propose to construct discriminative features via the shape factor for adversarial detection, employing the magnitude of Benford-Fourier (MBF) coefficients, which can be easily estimated using responses. Finally, a support vector machine is trained as an adversarial detector leveraging the MBF features. Extensive experiments in terms of image classification demonstrate that the proposed detector is much more effective and robust in detecting adversarial examples of different crafting methods and sources compared to state-of-the-art adversarial detection methods.
Adversarial example has been well known as a serious threat to deep neural networks (DNNs). In this work, we study the detection of adversarial examples based on the assumption that the output and internal responses of one DNN model for both adversarial and benign examples follow the generalized Gaussian distribution (GGD) but with different parameters (i.e., shape factor, mean, and variance). GGD is a general distribution family that covers many popular distributions (e.g., Laplacian, Gaussian, or uniform). Therefore, it is more likely to approximate the intrinsic distributions of internal responses than any specific distribution. Besides, since the shape factor is more robust to different databases rather than the other two parameters, we propose to construct discriminative features via the shape factor for adversarial detection, employing the magnitude of Benford-Fourier (MBF) coefficients, which can be easily estimated using responses. Finally, a support vector machine is trained as an adversarial detector leveraging the MBF features. Extensive experiments in terms of image classification demonstrate that the proposed detector is much more effective and robust in detecting adversarial examples of different crafting methods and sources compared to state-of-the-art adversarial detection methods.
Display Method:
2023, vol. 20, no. 1,
pp. 1-18,
doi: 10.1007/s11633-022-1390-8
Abstract:
Traditional joint-link robots have been widely used in production lines because of their high precision for single tasks. With the development of the manufacturing and service industries, the requirement for the comprehensive performance of robotics is growing. Numerous types of bio-inspired robotics have been investigated to realize human-like motion control and manipulation. A study route from inner mechanisms to external structures is proposed to imitate humans and animals better. With this idea, a brain-inspired intelligent robotic system is constructed that contains visual cognition, decision-making, motion control, and musculoskeletal structures. This paper reviews cutting-edge research in brain-inspired visual cognition, decision-making, motion control, and musculoskeletal systems. Two software systems and a corresponding hardware system are established, aiming at the verification and applications of next-generation brain-inspired musculoskeletal robots.
Traditional joint-link robots have been widely used in production lines because of their high precision for single tasks. With the development of the manufacturing and service industries, the requirement for the comprehensive performance of robotics is growing. Numerous types of bio-inspired robotics have been investigated to realize human-like motion control and manipulation. A study route from inner mechanisms to external structures is proposed to imitate humans and animals better. With this idea, a brain-inspired intelligent robotic system is constructed that contains visual cognition, decision-making, motion control, and musculoskeletal structures. This paper reviews cutting-edge research in brain-inspired visual cognition, decision-making, motion control, and musculoskeletal systems. Two software systems and a corresponding hardware system are established, aiming at the verification and applications of next-generation brain-inspired musculoskeletal robots.
2023, vol. 20, no. 1,
pp. 19-37,
doi: 10.1007/s11633-022-1343-2
Abstract:
In the past decades, artificial intelligence (AI) has achieved unprecedented success, where statistical models become the central entity in AI. However, the centralized training and inference paradigm for building and using these models is facing more and more privacy and legal challenges. To bridge the gap between data privacy and the need for data fusion, an emerging AI paradigm federated learning (FL) has emerged as an approach for solving data silos and data privacy problems. Based on secure distributed AI, federated learning emphasizes data security throughout the lifecycle, which includes the following steps: data preprocessing, training, evaluation, and deployments. FL keeps data security by using methods, such as secure multi-party computation (MPC), differential privacy, and hardware solutions, to build and use distributed multiple-party machine-learning systems and statistical models over different data sources. Besides data privacy concerns, we argue that the concept of “model” matters, when developing and deploying federated models, they are easy to expose to various kinds of risks including plagiarism, illegal copy, and misuse. To address these issues, we introduce FedIPR, a novel ownership verification scheme, by embedding watermarks into FL models to verify the ownership of FL models and protect model intellectual property rights (IPR or IP-right for short). While security is at the core of FL, there are still many articles referred to distributed machine learning with no security guarantee as “federated learning”, which are not satisfied with the FL definition supposed to be. To this end, in this paper, we reiterate the concept of federated learning and propose secure federated learning (SFL), where the ultimate goal is to build trustworthy and safe AI with strong privacy-preserving and IP-right-preserving. We provide a comprehensive overview of existing works, including threats, attacks, and defenses in each phase of SFL from the lifecycle perspective.
In the past decades, artificial intelligence (AI) has achieved unprecedented success, where statistical models become the central entity in AI. However, the centralized training and inference paradigm for building and using these models is facing more and more privacy and legal challenges. To bridge the gap between data privacy and the need for data fusion, an emerging AI paradigm federated learning (FL) has emerged as an approach for solving data silos and data privacy problems. Based on secure distributed AI, federated learning emphasizes data security throughout the lifecycle, which includes the following steps: data preprocessing, training, evaluation, and deployments. FL keeps data security by using methods, such as secure multi-party computation (MPC), differential privacy, and hardware solutions, to build and use distributed multiple-party machine-learning systems and statistical models over different data sources. Besides data privacy concerns, we argue that the concept of “model” matters, when developing and deploying federated models, they are easy to expose to various kinds of risks including plagiarism, illegal copy, and misuse. To address these issues, we introduce FedIPR, a novel ownership verification scheme, by embedding watermarks into FL models to verify the ownership of FL models and protect model intellectual property rights (IPR or IP-right for short). While security is at the core of FL, there are still many articles referred to distributed machine learning with no security guarantee as “federated learning”, which are not satisfied with the FL definition supposed to be. To this end, in this paper, we reiterate the concept of federated learning and propose secure federated learning (SFL), where the ultimate goal is to build trustworthy and safe AI with strong privacy-preserving and IP-right-preserving. We provide a comprehensive overview of existing works, including threats, attacks, and defenses in each phase of SFL from the lifecycle perspective.
2023, vol. 20, no. 1,
pp. 38-56,
doi: 10.1007/s11633-022-1369-5
Abstract:
In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.
In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.
2023, vol. 20, no. 1,
pp. 57-78,
doi: 10.1007/s11633-022-1361-0
Abstract:
In the past decade, multimodal neuroimaging and genomic techniques have been increasingly developed. As an interdisciplinary topic, brain imaging genomics is devoted to evaluating and characterizing genetic variants in individuals that influence phenotypic measures derived from structural and functional brain imaging. This technique is capable of revealing the complex mechanisms by macroscopic intermediates from the genetic level to cognition and psychiatric disorders in humans. It is well known that machine learning is a powerful tool in the data-driven association studies, which can fully utilize priori knowledge (intercorrelated structure information among imaging and genetic data) for association modelling. In addition, the association study is able to find the association between risk genes and brain structure or function so that a better mechanistic understanding of behaviors or disordered brain functions is explored. In this paper, the related background and fundamental work in imaging genomics are first reviewed. Then, we show the univariate learning approaches for association analysis, summarize the main idea and modelling in genetic-imaging association studies based on multivariate machine learning, and present methods for joint association analysis and outcome prediction. Finally, this paper discusses some prospects for future work.
In the past decade, multimodal neuroimaging and genomic techniques have been increasingly developed. As an interdisciplinary topic, brain imaging genomics is devoted to evaluating and characterizing genetic variants in individuals that influence phenotypic measures derived from structural and functional brain imaging. This technique is capable of revealing the complex mechanisms by macroscopic intermediates from the genetic level to cognition and psychiatric disorders in humans. It is well known that machine learning is a powerful tool in the data-driven association studies, which can fully utilize priori knowledge (intercorrelated structure information among imaging and genetic data) for association modelling. In addition, the association study is able to find the association between risk genes and brain structure or function so that a better mechanistic understanding of behaviors or disordered brain functions is explored. In this paper, the related background and fundamental work in imaging genomics are first reviewed. Then, we show the univariate learning approaches for association analysis, summarize the main idea and modelling in genetic-imaging association studies based on multivariate machine learning, and present methods for joint association analysis and outcome prediction. Finally, this paper discusses some prospects for future work.
2023, vol. 20, no. 1,
pp. 79-91,
doi: 10.1007/s11633-022-1360-1
Abstract:
Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···}\begin{document}$\in$\end{document} ![]()
![]()
“color” subspace yet cube \begin{document}$\in$\end{document} ![]()
![]()
“shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.
Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···}
2023, vol. 20, no. 1,
pp. 92-108,
doi: 10.1007/s11633-022-1365-9
Abstract:
This paper introduces deep gradient network (DGNet), a novel deep framework that exploits object gradient supervision for camouflaged object detection (COD). It decouples the task into two connected branches, i.e., a context and a texture encoder. The essential connection is the gradient-induced transition, representing a soft grouping between context and texture features. Benefiting from the simple but efficient framework, DGNet outperforms existing state-of-the-art COD models by a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80 fps) and achieves comparable results to the cutting-edge model JCSOD-CVPR21 with only 6.82% parameters. The application results also show that the proposed DGNet performs well in the polyp segmentation, defect detection, and transparent object segmentation tasks. The code will be made available athttps://github.com/GewelsJI/DGNet .
This paper introduces deep gradient network (DGNet), a novel deep framework that exploits object gradient supervision for camouflaged object detection (COD). It decouples the task into two connected branches, i.e., a context and a texture encoder. The essential connection is the gradient-induced transition, representing a soft grouping between context and texture features. Benefiting from the simple but efficient framework, DGNet outperforms existing state-of-the-art COD models by a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80 fps) and achieves comparable results to the cutting-edge model JCSOD-CVPR21 with only 6.82% parameters. The application results also show that the proposed DGNet performs well in the polyp segmentation, defect detection, and transparent object segmentation tasks. The code will be made available at
2023, vol. 20, no. 1,
pp. 109-120,
doi: 10.1007/s11633-022-1353-0
Abstract:
Structural neural network pruning aims to remove the redundant channels in the deep convolutional neural networks (CNNs) by pruning the filters of less importance to the final output accuracy. To reduce the degradation of performance after pruning, many methods utilize the loss with sparse regularization to produce structured sparsity. In this paper, we analyze these sparsity-training-based methods and find that the regularization of unpruned channels is unnecessary. Moreover, it restricts the network′s capacity, which leads to under-fitting. To solve this problem, we propose a novel pruning method, named MaskSparsity, with pruning-aware sparse regularization. MaskSparsity imposes the fine-grained sparse regularization on the specific filters selected by a pruning mask, rather than all the filters of the model. Before the fine-grained sparse regularization of MaskSparity, we can use many methods to get the pruning mask, such as running the global sparse regularization. MaskSparsity achieves a 63.03% float point operations (FLOPs) reduction on ResNet-110 by removing 60.34% of the parameters, with no top-1 accuracy loss on CIFAR-10. On ILSVRC-2012, MaskSparsity reduces more than 51.07% FLOPs on ResNet-50, with only a loss of 0.76% in the top-1 accuracy. The code of this paper is released athttps://github.com/CASIA-IVA-Lab/MaskSparsity . We have also integrated the code into a self-developed PyTorch pruning toolkit, named EasyPruner, at https://gitee.com/casia_iva_engineer/easypruner .
Structural neural network pruning aims to remove the redundant channels in the deep convolutional neural networks (CNNs) by pruning the filters of less importance to the final output accuracy. To reduce the degradation of performance after pruning, many methods utilize the loss with sparse regularization to produce structured sparsity. In this paper, we analyze these sparsity-training-based methods and find that the regularization of unpruned channels is unnecessary. Moreover, it restricts the network′s capacity, which leads to under-fitting. To solve this problem, we propose a novel pruning method, named MaskSparsity, with pruning-aware sparse regularization. MaskSparsity imposes the fine-grained sparse regularization on the specific filters selected by a pruning mask, rather than all the filters of the model. Before the fine-grained sparse regularization of MaskSparity, we can use many methods to get the pruning mask, such as running the global sparse regularization. MaskSparsity achieves a 63.03% float point operations (FLOPs) reduction on ResNet-110 by removing 60.34% of the parameters, with no top-1 accuracy loss on CIFAR-10. On ILSVRC-2012, MaskSparsity reduces more than 51.07% FLOPs on ResNet-50, with only a loss of 0.76% in the top-1 accuracy. The code of this paper is released at
2023, vol. 20, no. 1,
pp. 121-144,
doi: 10.1007/s11633-022-1367-7
Abstract:
Swarm intelligence has become a hot research field of artificial intelligence. Considering the importance of swarm intelligence for the future development of artificial intelligence, we discuss and analyze swarm intelligence from a broader and deeper perspective. In a broader sense, we are talking about not only bio-inspired swarm intelligence, but also human-machine hybrid swarm intelligence. In a deeper sense, we discuss the research using a three-layer hierarchy: in the first layer, we divide the research of swarm intelligence into bio-inspired swarm intelligence and human-machine hybrid swarm intelligence; in the second layer, the bio-inspired swarm intelligence is divided into single-population swarm intelligence and multi-population swarm intelligence; and in the third layer, we review single-population, multi-population and human-machine hybrid models from different perspectives. Single-population swarm intelligence is inspired by biological intelligence. To further solve complex optimization problems, researchers have made preliminary explorations in multi-population swarm intelligence. However, it is difficult for bio-inspired swarm intelligence to realize dynamic cognitive intelligent behavior that meets the needs of human cognition. Researchers have introduced human intelligence into computing systems and proposed human-machine hybrid swarm intelligence. In addition to single-population swarm intelligence, we thoroughly review multi-population and human-machine hybrid swarm intelligence in this paper. We also discuss the applications of swarm intelligence in optimization, big data analysis, unmanned systems and other fields. Finally, we discuss future research directions and key issues to be studied in swarm intelligence.
Swarm intelligence has become a hot research field of artificial intelligence. Considering the importance of swarm intelligence for the future development of artificial intelligence, we discuss and analyze swarm intelligence from a broader and deeper perspective. In a broader sense, we are talking about not only bio-inspired swarm intelligence, but also human-machine hybrid swarm intelligence. In a deeper sense, we discuss the research using a three-layer hierarchy: in the first layer, we divide the research of swarm intelligence into bio-inspired swarm intelligence and human-machine hybrid swarm intelligence; in the second layer, the bio-inspired swarm intelligence is divided into single-population swarm intelligence and multi-population swarm intelligence; and in the third layer, we review single-population, multi-population and human-machine hybrid models from different perspectives. Single-population swarm intelligence is inspired by biological intelligence. To further solve complex optimization problems, researchers have made preliminary explorations in multi-population swarm intelligence. However, it is difficult for bio-inspired swarm intelligence to realize dynamic cognitive intelligent behavior that meets the needs of human cognition. Researchers have introduced human intelligence into computing systems and proposed human-machine hybrid swarm intelligence. In addition to single-population swarm intelligence, we thoroughly review multi-population and human-machine hybrid swarm intelligence in this paper. We also discuss the applications of swarm intelligence in optimization, big data analysis, unmanned systems and other fields. Finally, we discuss future research directions and key issues to be studied in swarm intelligence.
2015, vol. 12, no. 3,
pp. 229-242,
doi: 10.1007/s11633-015-0893-y
2015, vol. 12, no. 2,
pp. 134-141,
doi: 10.1007/s11633-015-0880-3
2015, vol. 12, no. 3,
pp. 337-342,
doi: 10.1007/s11633-014-0870-x
2017, vol. 14, no. 6,
pp. 672-685,
doi: 10.1007/s11633-016-1010-6
2016, vol. 13, no. 3,
pp. 199-225,
doi: 10.1007/s11633-016-1004-4
2015, vol. 12, no. 4,
pp. 343-367,
doi: 10.1007/s11633-015-0894-x
2017, vol. 14, no. 1,
pp. 10-20,
doi: 10.1007/s11633-016-1039-6
2015, vol. 12, no. 1,
pp. 70-76,
doi: 10.1007/s11633-014-0820-7
2015, vol. 12, no. 2,
pp. 156-170,
doi: 10.1007/s11633-014-0825-2
2015, vol. 12, no. 5,
pp. 511-517,
doi: 10.1007/s11633-014-0859-5
2016, vol. 13, no. 1,
pp. 1-18,
doi: 10.1007/s11633-015-0913-y
2015, vol. 12, no. 2,
pp. 117-124,
doi: 10.1007/s11633-015-0878-x
2015, vol. 12, no. 2,
pp. 192-198,
doi: 10.1007/s11633-014-0868-4
2015, vol. 12, no. 6,
pp. 579-587,
doi: 10.1007/s11633-015-0901-2
2016, vol. 13, no. 4,
pp. 382-391,
doi: 10.1007/s11633-015-0918-6
2015, vol. 12, no. 4,
pp. 368-381,
doi: 10.1007/s11633-015-0895-9
2015, vol. 12, no. 4,
pp. 440-447,
doi: 10.1007/s11633-014-0856-8
2015, vol. 12, no. 4,
pp. 393-401,
doi: 10.1007/s11633-014-0858-6
2015, vol. 12, no. 2,
pp. 149-155,
doi: 10.1007/s11633-015-0881-2
2015, vol. 12, no. 3,
pp. 243-253,
doi: 10.1007/s11633-015-0888-8
2015, vol. 12, no. 1,
pp. 43-49,
doi: 10.1007/s11633-014-0866-6