MoNetViT: an efficient fusion of NCS and transformer technologies for visual navigation assistance with multi query attention

Abstract

Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional NCSs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining NCSs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.

1 Introduction

Navigating independently is a major challenge for individuals with visual impairments. It affects their ability to perform daily tasks and limits their involvement in social and economic activities. This challenge reduces their personal autonomy and impacts their overall quality of life. Indoor navigation issues for visually impaired individuals persist, as current solutions frequently do not overcome critical limits. Conventional GPS is inadequate inside due to signal interference, requiring alternative systems such as beacon-based technologies and smartphone applications that employ digital maps (Theodorou et al., 2022; Kubota, 2024). Nonetheless, these programs continuously rely on pre-existing maps, which aren’t universally accessible, therefore constraining their efficacy (Kubota, 2024).

The intricacy of inside environments exacerbates the issue, as visually impaired people encounter challenges in traversing unfamiliar settings on account of ambiguous aural or tactile indicators and numerous impediments (Jeamwatthanachai et al., 2019; Fernando et al., 2023). Although programs equivalent to Snap&Nav present navigation options via the creation of node maps, their dependence on sighted support diminishes person autonomy (Kubota, 2024). Moreover, the computing necessities of up to date fashions present difficulties. Advanced algorithms and machine studying strategies improve impediment identification and route planning however continuously necessitate substantial processing assets, rendering them impractical for cellular purposes (Tao and Ganz, 2020; Shah et al., 2023). The integration of IoT and cloud computing introduces further complexity, emphasizing the need for light-weight, reliable programs designed for visually impaired customers (Messaoudi et al., 2020). Rectifying these deficiencies is crucial for the development of inclusive indoor navigation options.

Recent analysis has more and more targeted on creating superior navigation aids to assist visually impaired people in each indoor and out of doors environments. Technologies equivalent to deep studying, machine imaginative and prescient, wearable gadgets, and cellular purposes have been leveraged to reinforce navigation capabilities, providing promising options to this pervasive challenge (Bai et al., 2019; El-taher et al., 2021; Kuriakose et al., 2021; Martínez-Cruz et al., 2021).

The significance of creating an efficient navigation assistance system for visually impaired people, significantly in obstacle-filled indoor environments, can’t be overstated. These environments current distinctive challenges that require subtle options succesful of offering correct and real-time steering. The integration of superior AI fashions, equivalent to MobileNetV2 and MobileViTV2, alongside with multi-query attention mechanisms, has proven potential in creating sturdy and efficient navigation programs. These fashions goal to empower visually impaired people, permitting them to navigate unfamiliar areas with confidence and independence (Wang et al., 2019).

The important analysis downside addressed on this examine is the event of an unbiased navigation mannequin that improves the accuracy and effectivity of detecting and deciphering navigation markers below excessive circumstances for individuals with visual impairment. Traditional navigation programs typically fall brief in complicated, obstacle-filled indoor environments, necessitating the necessity for a extra superior answer. This analysis proposes integrating MobileNetV2 and MobileViTV2 strategies, enhanced by multi-query attention mechanisms, to develop a navigation mannequin that gives exact and dependable assistance, thereby enhancing the standard of life for visually impaired people.

The integration of MobileNetV2 and MobileViTV2 strategies represents a cutting-edge method to creating an unbiased navigation mannequin for the visually impaired. MobileNetV2, launched by Sandler et al. (2018), is designed to function effectively on cellular and embedded gadgets, making it extremely appropriate for real-time purposes. Its structure employs inverted residuals and linear bottlenecks, which assist preserve excessive accuracy whereas decreasing computational calls for. This effectivity is essential for purposes requiring portability and speedy response, equivalent to navigation aids for visually impaired people.

On the opposite hand, MobileViTV2, as explored by Chen et al. (2021), makes use of imaginative and prescient transformers to reinforce the mannequin’s functionality to know visual contexts. Vision transformers are adept at capturing long-range dependencies inside pictures, offering a extra complete interpretation of complicated scenes. The integration of these technologies, mixed with multi-query attention mechanisms as highlighted by Mehta and Apple (2022), permits the mannequin to give attention to a number of points of the visual enter concurrently. This multifaceted attention mechanism is instrumental in enhancing the accuracy and timeliness of navigation directions, thus offering a major development over present fashions.

Existing analysis on navigation aids for visually impaired people has explored a spread of technological options, starting from GPS-based purposes to wearable gadgets and laptop imaginative and prescient strategies. Studies like these by Martínez-Cruz et al. (2021) and El-taher et al. (2021) have highlighted the restrictions of GPS in indoor environments and the bulkiness of wearable gadgets, respectively. These limitations underscore the necessity for extra refined and user-friendly options. The integration of deep studying fashions, equivalent to these utilizing NCSs, has proven promise; nevertheless, these fashions typically require substantial computational assets, limiting their practicality in cellular settings (Bai et al., 2019).

The latest improvement of efficient fashions like MobileNetV2 and imaginative and prescient transformers like MobileViTV2 addresses some of these challenges by providing excessive accuracy with diminished computational calls for. However, a niche stays within the efficient integration of these technologies to develop a complete navigation system that’s each light-weight and succesful of real-time processing. Additionally, the potential advantages of multi-query attention mechanisms in enhancing the main focus and accuracy of these fashions haven’t been totally explored. This hole presents an alternative to develop a novel, built-in answer that leverages these superior strategies for improved navigation assistance.

The goal of this analysis is to develop a brand new navigation mannequin for visually impaired people by combining MobileNetV2 and MobileViTV2 with multi-query attention mechanisms. The speculation is that this AI mannequin will enhance each the accuracy and effectivity of Aruco marker detection below difficult circumstances in comparison with present fashions. This development is anticipated to supply dependable navigation assistance in indoor environments, enhancing the standard of life for visually impaired customers. The examine focuses on designing, implementing, and evaluating the mannequin, with future work aimed toward refining fusion mechanisms, decreasing mannequin complexity, and exploring switch studying to keep up excessive accuracy whereas minimizing computational calls for.

Following a short introduction of the issue assertion and the proposed technique, the remainder of the paper is structured as follows: Section 2 outlines the analysis methodology used to conduct the examine. In Section 3, we current the search outcomes obtained from the analysis. Section 4 discusses the findings associated to multi-scale options and the varied combos of MQA and FFM used to enhance mannequin segmentation of multi-class ArUco markers. Finally, the conclusion of the paper is supplied in Section 5.

2 Methods

2.1 Transformer and NCSs

NCSs have demonstrated exceptional performance in various image segmentation tasks, showcasing their robust feature representation capabilities. However, despite these strengths, NCS-based methods frequently encounter limitations in modeling long-range relationships. A primary issue is their inefficiency in capturing global context information. Methods that rely on stacking receptive fields necessitate continuous downsampling convolution operations, leading to deeper networks. Training such deep neural networks on small datasets can present significant challenges, including training instability and overfitting. Overfitting is particularly common in deep learning models due to their strong expressive ability relative to traditional models (Zhang et al., 2023). Non-local attention mechanisms have been more and more utilized in numerous fields to handle challenges associated to capturing long-range dependencies and international info (Mei et al., 2020; Huang et al., 2022; Abozeid et al., 2023; Zhou et al., 2023). While these mechanisms can improve the community’s potential to seize international context, additionally they introduce appreciable computational complexity. This complexity, which is quadratic in relation to the enter measurement, typically renders these strategies impractical for high-resolution pictures.

Attention mechanisms were utilized in numerous research that focused on integrating Convolutional Neural Networks. Especially to further enhance the output processing of NCSs. Various visual tasks were implemented with integrated approaches, including video processing (Qi and Zhang, 2023; Sun et al., 2022; Mujtaba et al., 2022), picture classification (Dosovitskiy et al., 2021; Liu et al., 2021), and object detection (Benmouna et al., 2023; Wen et al., 2023).

The transformer in natural language processing used transformation tasks (Vaswani et al., 2023). Several pure language processing actions have since shifted to utilizing it. Some pure language processing actions have switched to utilizing ViT. Pre-training on very massive datasets is required for ViT (Chen et al., 2023; Misawa et al., 2024). To the State of the Art within the pure picture segmentation activity, Imagenet changed the encoder part of the decoding community with a transformer (Doppalapudi, 2023; Xia and Kim, 2023).

Although transformer-based models have demonstrated impressive skills in diverse visual tasks, they have not yet attained acceptable results when compared to traditional NCSs. Transformer designs still exhibit worse performance in visual tasks compared to similarly-sized NCSs, such as EfficientNet (Thakur et al., 2023). The computational price of transformers primarily based on the mechanism of self-attention is , in distinction to the convolution-based NCSs (Zhou et al., 2024). Therefore, using the transformer for image-related actions will unavoidably want a considerable quantity of GPU assets.

2.2 Image segmentation using transformer and NCSs

The present cutting-edge architecture in computer vision predominantly depends on complete NCSs, with UNet (Chen et al., 2021) and its variations being notable cases. The present state-of-the-art (SOTA) framework in laptop imaginative and prescient primarily depends on full NCSs, with UNet and its variants being distinguished examples. UNet (Chen et al., 2021) employs an encoding-decoding community structure. This structure makes use of cascaded convolutional layers to extract numerous ranges of visual traits. The decoder makes use of skip connections to recycle high-resolution characteristic maps generated by the encoder, enabling the retrieval of essential characteristic info (Petit et al., 2021).

2.3 Lightweight networks

Deep learning, although powerful, often requires extensive training data to effectively enhance model learning. However, challenges arise in scenarios like the ArUco dataset due to limitations in data collection related to factors such as lighting conditions, capture angles, and distances. Moreover, the availability of large, publicly accessible datasets is limited, further complicating model training (Lee et al., 2019). To deal with these challenges, the event of light-weight deep studying fashions turns into crucial.

Research in deep learning has demonstrated that supervised training of deep learning models heavily relies on large labeled datasets (Karimi et al., 2020). This requirement poses a major problem, particularly in situations the place knowledge assortment is constrained. Techniques equivalent to mannequin optimization, pruning, quantization, and information distillation have been explored to create light-weight deep-learning fashions appropriate for cellular terminals (Wang et al., 2022). These approaches goal to scale back the computational burden whereas sustaining mannequin efficiency.

A self-attention-based vision transformer (ViT), known as MobileViT, is employed to learn the global representation of images. MobileViT (Mehta and Apple, 2022) stands out because the preliminary light-weight, general-purpose transformer designed for cellular gadgets. An method integrating a transformer with a NCS-based light-weight mannequin was investigated, with a selected give attention to assessing the feasibility of this light-weight community mannequin for the difficult activity of ArUco marker segmentation.

2.4 Mini-MobileNet-MobileViT network

In this part, the Mini-MobileNet-MobileViT (MoNetViT) network architecture and its principal network components were introduced. The system’s backbone structure follows to the architecture of an encoder and decoder, as represented in Figure 1A. Section 2.4.1 provides a extra detailed clarification sub-network of the encoder, whereas Section 2.4.2 focuses on the exploration sub-network of the decoder. The MobileViT module, an important half of the community’s encoder structure, is launched in Section 2.4.3. This part covers the structure of the MobileViT module, its main calculation course of internally, and the comparisons between this module and NCS. Furthermore, the MQA module that’s urged on this examine is described in Section 2.4.4. The Globalized Block and the Asymmetrical Globalized Block, in addition to the justification for their adoption, are half of this module.

2.4.1 Encoder sub-network

The proposed model will be developed using an encoder-decoder structure, where the encoder will build two parallel paths connected by a series of attention additions, improving the model’s ability to capture spatial and channel dependencies. The encoder will use MobileNet v2 (MN2 block) (Sandler et al., 2018) and MobileViT block as the bottom module. , is the illustration of the enter picture, the place H and 𝑊 stand for the enter picture’s peak and width, respectively. The enter picture undergoes decision degradation via three consecutive levels. In every stage, the dimensions of the characteristic map is diminished by an element of 2. As a outcome, the output characteristic maps are shrunk to one-half, one-fourth, and one-eighth of the preliminary characteristic map. The MobileViT block is one of the important parts used within the encoder. The enter and output sizes of the MobileViT block are the identical, indicating that this module doesn’t change the spatial dimensions of the characteristic map. The MN2 block is one other primary module used within the encoder. Stride 1 implies that the module doesn’t carry out decision degradation, and the enter and output sizes stay the identical.

At the th stage in Equation 1, it’s assumed that (·) represents the transformation perform of the th MV2-Block. For instance, denotes the outcome produced by the 4th MN2-Block within the th stage. The MobileViT-block module on the th stage has a change perform denoted as (·). It is essential to emphasise that there’s a singular MobileViT-block current at every stage.

Moreover, if we represent the output generated by the th MN2-Block module during the th stage as (), it is important to highlight that just the first two MN2-Block components reduce the resolution of the original feature map. Thus, belongs to the set of elements in , where belong to the collection and is an element of the set . represents the numerical value assigned to the feature channel at the th stage.

The channel attention module is denoted by . It is reasonable to assume that at this stage , the output of the left path is and the output of the right path is . The formulas Equation 2 can be utilized to compute and :

Here, represents a convolution operation, represents the feature map, and and represent the outputs of the third and fourth MN2-Block modules at stage , respectively. The function concatenates the feature map with the output of the channel attention module applied to the sum of and . The function splits the resulting tensor into multiple parts.

2.4.2 Decoder sub-network

In the context of a decoder sub-network shown in Figure 1B, the perform represents the operation of the Feature Fusion Module (FFM) like figured in Equation 3. The module takes enter and processes it via a sequence of transformations involving convolutional operations and batch normalization. The system supplied is as follows:

represents a divided convolution purposeful with a kernel measurement of 3 × 3 and an enhance price of 1, which is the same as a standard convolution. represents a dilated convolution operation with a kernel measurement of 3 × 3 and an enlargement price of 2. This signifies a convolution with a dilation course of utilizing a 3 × 3 kernel space and an enlargement price that’s 1, equal to a standard. BatchNorm refers back to the batch normalization operation that standardizes the inputs to a layer for every mini-batch.

In the decoder stage of the network, the feature maps are represented as , where . After the encoder phase, the operation on the feature map (at the third stage of the decoder) is defined . The two main steps for calculating as described. This process entails increasing the resolution of the data and then combining it with the results from the earlier stage of the encoding process. The initial step involves performing up-sampling and Feature Fusion Mapping (FFM). Up-sampling is intended to adjust the feature size to match the output size of the encoder from the previous stage, as illustrated in Equation 4. This course of yields an intermediate variable, denoted as . The subsequent step entails a characteristic fusion operation with the encoder’s output from the prior stage, as demonstrated in Equation 5:

Upsample(·, t) denotes the procedure of augmenting the data map in accordance with the parameter t by bilinear interpolation. The PReLU function of activation is denoted as PReLU(·), whereas batch normalization is indicated as BatchNorm(·). Once all are calculated, a final prediction is obtained through a convolution.

Where is an element of the set Softmax(·) denotes the function that activates softmax. denotes the predicted class label map, with being the final output prediction, as demonstrated in Equation 6.

2.4.3 MobileViT block

Vision Transformers (ViTs) can achieve comparable accuracy to Convolutional Neural Networks (NCSs) in image identification tasks, especially when trained on extensive datasets (Dosovitskiy, 2021). On the opposite hand, in contrast to NCSs, ViTs are tough to optimize and require a big quantity of knowledge for coaching. Research signifies that the suboptimal efficiency of ViTs is because of an absence of inductive biase (Lee et al., 2019; Petit et al., 2021; Zhou et al., 2024). Inductive biases, whereas helpful, even have drawbacks for NCSs; they allow NCSs to seize native spatial info however can restrict the community’s total efficiency.

However, the transformer’s self-attention system has the capacity to collect global data. Numerous transformers and NCSs combinations have been investigated to overcome their respective deficiencies. ConViT (d’Ascoli et al., 2022) makes use of gated positional self-attention mushy convolutional inductive biases. Semantic segmentation fashions equivalent to ACNET (Hu et al., 2019) and CMANet (Zhu et al., 2022) have been developed; nevertheless, many of these fashions are computationally intensive. The risk of leveraging the strengths of each NCSs and ViTs to assemble a light-weight community for visual duties stays an space of ongoing exploration. MobileViT means that such an method is certainly possible. In this paper, we first study the calculations concerned in MobileViT.

The MobileViT Block, seen in

Figure 1D

, shares an equivalent construction with the MobileViT Unit (

Mehta and Apple, 2022

). The following 4 phases are utilized to a given supply tensor

The input tensor 𝑋 is first passed through an standard convolution layer, followed by a convolution layer to generate . The convolution stage captures and represents nearby spatial details, whereas the convolution transforms the tensors into higher-dimensional spaces (with dimensions, where is more than ) by acquiring knowledge of a linear combination of the input channels.

In order to incorporate spatial inductive bias into MobileViT’s learning process, the input is divided into non-overlapping flattening patches , where . The total number of patches is represented by the formula , where h and w are the physical dimensions of every single patch, individually.

The transformer is then applied to encode the relationships between the patches through the following operation, as demonstrated in Equation 7

The resulting is then folded back to obtain .

Ultimately, is transformed into a space with fewer dimensions ( dimensions) using point-by-point convolution and then merging with using concatenation.

The second phase contains the algorithm’s core. The input image , which has dimensions , is separated into patches in the standard ViT structure. Subsequently, every patch undergoes a linear transformation to convert it into a vector. These vectors are then encoded with positional information. Furthermore, the interconnections between the patches are acquired by employing transformer blocks.

Contrary to ViT, the MobileViT algorithm preserves both the patch order and the physical order of pixels inside each of the patches during its second stage. It is crucial to emphasize that the values of and must be exact divisors of and , respectively.

Local information can be encoded by the relationship . The yellow pixels inside a patch have the ability to aggregate data from the pixels that surround them in that patch, as seen in Figure 1C. accomplishes the worldwide knowledge encoding of the transformer by encoding inter-patch connections on the -th place of each patch. The crimson pixel retains observe of each one of the pixels that encode the complete picture since, as Figure 1C illustrates, it could actually determine the yellow pixel that’s on the identical location in different patches. The light-weight facet of the mannequin is enhanced by the dot product process, which selects solely pixels which can be in the identical place.

According to Mehta and Apple (2022), peculiar convolution might be damaged down into three steps: unfolding, matrix multiplication, and folding. Based on the beforehand indicated computation, a layer of convolution and the Unfold operation perform the native characteristic modeling, offering them with convolution-like inductive biases. Next, international characteristic modeling is carried out utilizing the Transformer → Fold sequence, which provides the MobileViT block international processing energy.

2.5 Multi query attention

Figure 2A

shows the configuration of an ordinary non-local block (

Wang et al., 2018

). The non-local block (

Wang et al., 2018

) first requires the computation of the similarity between all locations. This is completed by performing matrix multiplication on an enter

. The main computational process within the non-local block might be succinctly described as consisting of the next 5 steps:

The source feature is subjected to three convolutions, denoted as , , and , resulting in the transformation of into , and . The three numbers correspond to the query, key, and value, respectively. They are used to change the total amount of streams from 𝐶 to .

A similarity matrix M is created by flattening the query, key, and value to size , where . The matrix is calculated to determine the similarity, as demonstrated in Equation 8:

The matrix is normalized using a normalization function such as softmax:

The matrix of attention is then derived, as demonstrated in Equation 9:

Reassessing the Non-local Asymmetric Block

The computational complexity of the global attention block can be described as . The calculation efficiency is mostly affected by N’s size. To fix this, reduce to without altering output size. Zhu et al. (2019) provides the uneven non-local block, whose development is proven in Figure 2B, to sort out this downside.

The asymmetrical pyramid non-local block (APNB) is a modified version of this block that incorporates pyramid pooling within the non-local block in order to decrease computational expenses. One more thing is added after and : a spatial pyramid pooling function (Lazebnik et al., 2006) to pick just a few good anchor factors. When the spatial pyramid pooling modules are and , with n denoting the pooling layer’s output measurement (width or peak) and as per Zhu et al. (2019), the general quantity of sampling anchor factors is . If we assume that and , the quantity of calculations shall be diminished by an quantity . This adjustment effectively decreases the worth of to a decrease worth, , by selectively sampling just a few pattern knowledge from and , as an alternative of using all of the factors.

The primary computational procedure in the asymmetrical non-local block entails the subsequent modifications, as demonstrated in Equations 10, 11:

The computational technique for the APNB module may be delineated as follows:

Introduce sampling modules and after and γ, accordingly, to pattern a number of sparse anchor factors. These anchor factors are designated as and , respectively.

Generating a similarity matrix , as demonstrated in Equation 12:

is normalized.

stands for normalization perform.

The attention matrix AP is subsequently computed, as demonstrated in Equation 13:

The final result is obtained by adding the product of the variables and to the variable X, and assigning it to the variable is a one-by-one convolution. represents a convolutions.

2.6 Motivation

The APNB discussed earlier operates with a single data stream as input, while the asymmetrical fusion nonlocal block typically utilizes two data sources: the top-level feature map and the lower-level feature map. In contrast, the proposed MQA mechanism extends this approach by incorporating four input sources. As shown in Figure 1A, MQA integrates the elemental worth (), the elevated traits and , and the sting content material (), permitting for the specific acquisition of a number of ranges of characteristic illustration. By incorporating edge info all through the semantic segmentation course of, the mannequin imposes useful constraints, enhancing segmentation precision. The cross-entropy loss perform additional refines this course of by measuring the distinction between the bottom reality () and the characteristic aggregation map ), guaranteeing sturdy alignment of predictions with the precise knowledge.

Commence the acquisition of a parameter map of features and additional feature maps , , and . Five 1 × 1 convolutions, denoted as and , are applied to transform these input maps into new feature maps: and , as demonstrated in Equation 14:

The parameters , , , and in this experiment correspond to the results , , , and P3 from MoNetViT. The sample as well as main computation methodologies within the MQA module were equivalent to those in APNB. The selection units , and are utilized to sample multiple sparse anchor points. These anchor points are represented as and , where and denote the number of sampled anchor points. is less than , which is less than , and is much less than . Mathematically, this is computed using the following Equation 15:

The correlation matrix for and anchoring is displayed here, as demonstrated in Equation 16:

The dimensions of the are , where is significantly smaller than . Next, the process of normalization is carried out on , which enables the calculation of .

The ultimate result of the initial layer is, as demonstrated in Equation 17:

The value of is equal to the product of and . belongs to the set of . The similarity matrix for levels 2, 3, and 4 is computed using an analogy, as demonstrated in Equation 18:

The equation is equal to the product of and . The equation is equal to the product of and . The equation is equal to the product of and .

The ultimate result of levels 2, 3, and 4 is determined in the following manner Equation 19:

The value of is equal to the product of and , where belongs to the set . The value of is equal to the product of and , where belongs to the set . The value of is equal to the product of and , where belongs to the set .

The final result is represented as belonging to the set of . The temporal complexity can be represented as , which is significantly lower than in the conventional non-local block.

2.7 Loss function

The function that measures loss is defined like the ones used by Yeung et al. (2022). The loss perform contains two parts: and . Equations 20, 21 (Yeung et al., 2022) show the loss perform.

where ground-truth (GT) and the anticipated edge map Pe’s coordinates for each pixel point are represented by ().

IoU loss and a conventional cross-entropy loss make up the two components of the loss function.

2.8 Experiment setup

The NVIDIA Tesla T4 GPU was used to train the model within the PyTorch framework for this project. The model underwent training for 50 epochs using a batch size of 16, with the Adam optimizer and a starting learning rate of 1e-3. A learning rate reduction factor of 0.5 was applied when no improvement was observed for 5 consecutive epochs (patience set to 5). The specific hyperparameters utilized in this investigation are outlined in Table 1. In addition, the experiment aimed to match the proposed MoNetViT mannequin with a number of state-of-the-art strategies, together with TransFuse (Zhang et al., 2021), Inf-Net (Fan et al., 2020), U-Net (Ronneberger et al., 2015), U-Net++ (Kwak and Sung, 2021), Mini-Seg (Kim et al., 2023), and DeepLabV3+ (Asadi Shamsabadi et al., 2022), for multi-class segmentation in ArUco marker identification.

Hyperparameter	Options
Resize the images	224×224
Epochs	50
Batch size	16
Optimizer	Adam
Learning rate (Lr)	1e – 3
Factor	0.5
Patience	5
,	0.2, 0.8

The researchers conducted experiments using freely available datasets for ArUco manual labeling. The dataset employed in this study is an open-source resource that has been fully labeled to indicate various classes of ArUco markers, making it ideal for training models in identification and classification tasks. The dataset preparation involved several pre-processing steps to enhance model generalizability and ensure reproducibility. Images were resized to a resolution of 224 × 224, and pixel values were normalized to the range 0 and 1. Additionally, data augmentation techniques such as random rotation, flipping, and contrast adjustment were applied to improve the model’s ability to generalize across varied conditions. The complete dataset, including all necessary images and labels for training and testing models in ArUco marker identification and classification tasks, can be accessed and downloaded from the following link: https://universe.roboflow.com/loliktry/dataarucomustofa/dataset/5. It was particularly utilized within the multi-class segmentation experiments. Table 2 supplies a complete overview of the precise particulars of this dataset.

Marker class	No. training	No. valid.	No. testing	Total images
1	853	188	289	1,330
2	904	238	269	1,411
3	895	237	271	1,403
Total	2,652	663	829	4,144

Specifications of datasets.

Furthermore, the present study utilized identical methodologies as Fan et al. (2020) to evaluate the efficiency of the mannequin. Standard standards, equivalent to accuracy, specificity, sensitivity, and Dice similarity coefficient, comprise the evaluation metrics. In addition, it makes use of a number of metrics from object recognition analysis strategies, such because the design measure, the improved alignment worth (Fan et al., 2018), and the imply absolute error.

3 Results

The dataset utilized in this study is notably large-scale, comprising a total of 4,144 slices. Consequently, the following sections will focus exclusively on a detailed analysis of the results, offering an in-depth examination of the findings and their implications.

3.1 Three-class ArUco marker labeling results

The segmentation data findings for ArUco Marker on the dataset are displayed in Figure 3, demonstrating that MoNetViT on this examine displays superior efficiency in comparison with different baseline fashions. U-Net and U-Net++ have low Dice scores and sensitivities, leading to massive unsegmented areas. Inf-Net and Mini-Seg present slight enhancements however nonetheless lack correct boundary detection. While TransFuse, a NCS + Transformer mannequin, was evaluated, it isn’t included in Figure 3 on account of its very low Dice rating. DeepLabV3+ performs fairly nicely however falls brief of MoNetViT. Overall, MoNetViT achieves the very best Dice, sensitivity, and specificity scores, alongside with the bottom MAE, indicating its superior segmentation precision.

The MoNetViT model shows superior performance over other state-of-the-art models, including U-Net, U-Net++, Mini-Seg, Inf-Net, TransFuse, and DeepLabV3+, across key evaluation metrics. As summarized in the Table 3, MoNetViT achieves the very best Dice rating (0.9584), sensitivity (0.9424), specificity (0.9424), structural similarity 0.9923), and imply edge accuracy (0.9381), whereas additionally attaining the bottom Mean Absolute Error (MAE) of 0.0077.

Methods	Param.(M)	Size(Mb)	Dice	Sen.	Spec.			MAE
U-Net	1.953	7.438	0.5622	0.6469	0.9803	0.9676	0.5622	0.0344
U-Net++	7.783	29.69	0.6082	0.5744	0.9013	0.9816	0.6082	0.0351
Inf-Net	0.076	0.291	0.5889	0.5585	0.8865	0.9762	0.5889	0.0458
Mini-Seg	0.038	0.145	0.6238	0.6268	0.9537	0.9845	0.6238	0.0292
TransFuse	0.019	0.073	0.3231	0.3333	0.6667	0.9403	0.3231	0.1177
DeepLabV3+	13.324	50.508	0.6351	0.6392	0.9655	0.9881	0.6351	0.0221
MoNetVIT(ours)	1.014	3.869	0.9584	0.9424	0.9424	0.9923	0.9381	0.0077

Result of three-class labeling.

Bold value indicates the best value.

This outstanding performance is attributed to the integration of Convolutional Neural Networks (NCSs) with a transformer module, which captures both local and global semantic features. The transformer component is crucial for calculating global semantic relationships, while the NCS module extracts local contextual features, resulting in a more robust feature representation. Additionally, the multi-query attention (MQA) module enriches feature diversity through supervised learning, enhancing overall model performance.

An analysis of false positives (FP) and false negatives (FN) further emphasizes MoNetViT’s robustness. With high sensitivity (0.9424) and specificity (0.9424), MoNetViT significantly reduces both FP and FN compared to other models. For instance, while DeepLabV3+ achieves a Dice score of 0.6351, its sensitivity (0.6392) and specificity (0.9655) indicate a higher FN rate relative to MoNetViT. Similarly, Mini-Seg’s balanced sensitivity (0.6268) and specificity (0.9537) suggest that it is more prone to FP and FN, impacting its reliability. In contrast, MoNetViT’s ability to minimize FP and FN contributes to its high Dice score and overall segmentation accuracy.

Despite Inf-Net having the smallest model size (0.073 MB) and TransFuse having the smallest parameter count (0.038 M), MoNetViT outperforms these models in critical metrics. Its effectiveness stems from model design rather than sheer training data volume, highlighting MoNetViT’s robustness and capacity to generalize effectively on the dataset. Paired T-tests indicated that MoNetViT’s enhancements had been statistically important (p

3.2 Component impact analysis

Several experiments were conducted to validate the functionality of the Multi Query Attention (MQA) and Fusion Feature Module (FFM), two crucial elements of the MoNetViT. Figure 1A depicts an structure consisting of three levels. The segmentation efficiency of the MoNetViT mannequin is significantly improved by the MQA module part, because the findings displayed in Table 4 reveal.

Methods	Loss	Dice	Sen.	Spec.			MAE
Backbone	0.0183	0.9584	0.9424	0.9424	0.9923	0.9381	0.0077
Backbone+MQA	0.0192	0.9558	0.9360	0.9360	0.9919	0.9342	0.0081
Backbone+FFM	0.0189	0.9566	0.9322	0.9322	0.9921	0.9354	0.0079
Backbone+MQA + FFM	0.0115	0.9731	0.9686	0.9686	0.9951	0.9599	0.0049

Bold value indicates the best value.

Employing MQA and FFM in conjunction with the baseline improves segmentation performance. Specifically, integrating both MQA and FFM with the baseline resulted in improvements of 1.5 and 1.7% in the Dice coefficient, respectively. The findings indicate that using MQA and FFM enhances the encoder’s and decoder’s clarity, thereby further improving segmentation performance. The analysis shows that combining Backbone, FFM, and MQA results in the best performance across all metrics. This model has the lowest loss (0.0115) and the highest Dice coefficient (0.9731), indicating more accurate segmentation and fewer errors. It also achieves the best sensitivity (0.9686) and specificity (0.9686), accurately identifying both positive and negative cases. Additionally, it has the highest structural alignment ( = 0.9951) and precision ( = 0.9599), along with the lowest Mean Absolute Error (0.0049). In comparison, other methods like Backbone alone, Backbone+FFM, and Backbone+MQA perform worse in various metrics, highlighting the benefits of using both FFM and MQA together.

3.3 Parameter comparison

Figure 4 supplies an in depth overview of the mannequin efficiency. MoNetViT achieves superior accuracy whereas sustaining a comparatively small parameter depend. Specifically, it makes use of about half the parameters of U-Net (1.014 M vs. 1.953 M) and considerably fewer parameters than U-Net++ (7.783 M) and DeepLabV3+ (13.324 M). While Inf-Net (0.076 M) and Mini-Seg (0.038 M) have smaller parameter counts, and TransFuse makes use of the fewest (0.019 M), MoNetViT achieves a a lot increased Dice rating (0.9584) in comparison with these fashions. This demonstrates that MoNetViT not solely optimizes mannequin measurement and complexity but in addition outperforms different fashions, together with transformer-based fashions like TransFuse, in phrases of accuracy (Figure 5).

4 Discussion

4.1 Comparison of multi-scale features

Numerous models for combining multi-scale features use conventional networks for object identification and semantic segmentation in image processing. For example, architectures like Feature Pyramid Network (FPN) and U-Net rely on three primary network paths: bottom-up, top-down, and horizontal connections. These paths allow the integration of high-level semantic information with low-level geometric data. In FPN, the bottom-up path extracts high-level features, while the top-down path applies upsampling to enhance semantic details at higher resolutions. Horizontal connections fuse low-level convolution features with high-level features, resulting in a more detailed representation of semantic information.

Nevertheless, FPN faces challenges due to its complex hierarchical structure. The computation of intermediary layers relies heavily on the higher-level layers, requiring the analysis of preceding layers to be completed before passing information to subsequent layers. This dependency can lead to inefficiencies in computation and integration. The proposed MQA approach addresses these limitations by enabling the simultaneous integration of lower-level map attributes, higher-level attribute maps, and edge attribute maps, streamlining the process and enhancing feature representation.

The current work offers a direct and efficient MQA approach and introduces a cascading multi input computational framework. The system employs a mechanism for attention to iteratively compute and use feature maps of varying sizes, directing the ultimate semantic segmentation process. The proposed MQA has the ability to combine multiple input and multiple scale features, and can be trained end-to-end with ground truth supervision. The proposed technique effectively leverages both low and high-resolution features and integrates a method of attention to successfully accomplish the segmentation job on ArUco markers.

4.2 Comparison of different combination MQA and FFM

To evaluate the impact of MQA and FFM on MoNetViT’s performance, a series of tests were conducted. The experiments utilized the same network backbone and implementation details to ensure consistency with previous studies. The results, as shown in the radar chart and table, compare the baseline “Backbone,” “Backbone+MQA,” “Backbone+FFM,” and “Backbone+MQA + FFM” configurations. The radar chart illustrates that the region representing “Backbone+MQA + FFM” (in red) is larger than those of other configurations, indicating superior performance across key metrics. Similarly, the table reinforces these findings, showing that “Backbone+MQA + FFM” achieves the highest Dice score (0.9731), sensitivity (0.9686), and specificity (0.9686), along with the lowest MAE (0.0049). These results suggest that integrating both MQA and FFM significantly enhances the backbone’s performance, as the areas with MQA and FFM have a notably larger magnitude than those without these modules.

The contributions of the Multi-Query Attention (MQA) and Feature Fusion Module (FFM) to the segmentation performance were further validated through an ablation study. Table 4 highlights the numerous enhancements achieved by integrating these modules into the baseline mannequin. Specifically, the Dice coefficient elevated from 0.9584 for the baseline to 0.9558 (+1.7%) with MQA alone and 0.9566 (+1.9%) with FFM alone. When each modules had been mixed, the Dice coefficient reached 0.9731 (+3.4%), demonstrating their synergistic impact. Furthermore, sensitivity and specificity improved from 0.9424 every within the baseline to 0.9686 with the mixed MQA and FFM setup, whereas the Mean Absolute Error (MAE) decreased from 0.0077 to 0.0049. These findings emphasize the important position of MQA in enhancing multi-scale characteristic integration and FFM in refining characteristic readability, leading to superior segmentation efficiency. This sturdy enchancment throughout key metrics underscores the effectiveness of the proposed MoNetViT structure in addressing complicated segmentation duties.

MoNetViT’s architecture demonstrates significant advantages through its dual-path encoder, which effectively balances local feature extraction using NCSs and global feature extraction via Transformers. This design allows the model to capture both fine-grained details and long-range dependencies, improving segmentation performance. Additionally, the integration of the Multi-Query Attention (MQA) module enhances multi-scale feature integration, enabling the model to better aggregate features across varying spatial scales. This contributes to improved segmentation accuracy, particularly in complex scenarios. Furthermore, MoNetViT’s lightweight design minimizes computational demands, making it highly efficient for real-time applications without sacrificing performance. Compared to models like DeepLabV3+, which rely on more resource-intensive architectures, MoNetViT achieves a superior balance of accuracy and efficiency, reinforcing its suitability for deployment in resource-constrained environments.

While MoNetViT demonstrates strong performance, scalability to larger datasets may require optimization strategies, and its adaptability to diverse marker types needs further evaluation across different styles and conditions. Future experiments will focus on enhancing scalability and generalizability through transfer learning and broader dataset evaluations.

5 Conclusion

This study introduces a novel model called MoNetViT, which utilizes fused NCSs and transformers to create a segmentation model for ArUco marker-infested regions. The picture features are extracted simultaneously utilizing Convolutional Neural Networks (NCSs) and transformers, resulting in a reduction in computing burden and model complexity, while enhancing the segmentation performance. Furthermore, this work introduces the multi-query attention (MQA) module as a means to enhance performance. The empirical findings demonstrate that MoNetViT outperforms the other approaches on the ArUco dataset. Further studies will concentrate on the influence of the merging of each model and on discovering strategies for decreasing the level of detail within the model. Future research will focus on enhancing the capabilities of MoNetViT to achieve even more robust outcomes. This includes exploring additional fusion methods to further optimize feature integration and segmentation accuracy. Another key direction is adapting the model for outdoor environments by incorporating GPS data, thereby extending its applicability to diverse navigation scenarios. Leveraging transfer learning techniques will also be prioritized to reduce training times and improve scalability, enabling the model to handle larger and more diverse datasets effectively. These advancements aim to broaden the utility and efficiency of MoNetViT, ensuring its suitability for a wide range of real-world applications.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

Author contributions

LT: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing. RG: Conceptualization, Formal analysis, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing. Prayitno: Conceptualization, Formal analysis, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing, Data curation.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

The authors thank the Ministry of Education, Culture, Research, and Technology for the research grant and support provided for this doctoral dissertation. Additionally, we extend our thanks to Diponegoro University, especially the Doctoral Program of Information Systems, for their continuous support and valuable resources that have significantly contributed to the success of this research.

Conflict of curiosity

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI assertion

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s observe

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

AbozeidA.TalobaA.Faiz AlwaghidA.SalemM.ElhadadA. (2023). An efficient indoor localization based on deep attention learning model. Comput. Syst. Sci. Eng.46, 2637–2650. doi: 10.32604/csse.2023.037761
Asadi ShamsabadiE.XuC.RaoA. S.NguyenT.NgoT.Dias-da-CostaD. (2022). Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Autom. Constr.140:104316. doi: 10.1016/j.autcon.2022.104316
BaiJ.LiuZ.LinY.LiY.LianS.LiuD. (2019). Wearable travel aid for environment perception and navigation of visually impaired people. Electronics8:697. doi: 10.3390/electronics8060697
BenmounaB.PourdarbaniR.SabziS.Fernandez-BeltranR.García-MateosG.Molina-MartínezJ. M. (2023). Attention mechanisms in convolutional neural networks for nitrogen treatment detection in tomato leaves using hyperspectral images. Electronics12:22706. doi: 10.3390/electronics12122706
ChenJ.LuY.YuQ.LuoX.AdeliE.WangY.et al. (2021). TransUNet: transformers make strong encoders for medical image segmentation, arXiv. 1–13. doi: 10.48550/arXiv.2102.04306
ChenY.WangT.TangH.ZhaoL.ZhangX.TanT.et al. (2023). CoTrFuse: a novel framework by fusing NCS and transformer for medical image segmentation. Phys. Med.68:175027. doi: 10.1088/1361-6560/acede8
d’AscoliS.TouvronH.LeavittM. L.MorcosA. S.BiroliG.SagunL. (2022). ConViT: improving vision transformers with soft convolutional inductive biases. J. Stat. Mech. Theor. Exp.2022, 2286–2296. doi: 10.1088/1742-5468/ac9830
DoppalapudiS. A. I. K. (2023). Semantic image segmentation using transformers.
DosovitskiyA. (2021). “An image is WORTH 16X16 WORDS: transformers for image recognition at scale,” in ICLR 2021 – 9th International Conference on Learning Representations.
DosovitskiyA.BeyerL.KolesnikovA.WeissenbornD.ZhaiXUnterthinerT.et al. (2021). “An image is Worth 16×16 Words: transformers for image recognition at scale,” arXiv. doi: 10.48550/arXiv.2010.11929
El-taherF.TahaA.CourtneyJ.MckeeverS. (2021). A systematic review of urban navigation systems for visually impaired people. Sensors21:3103. doi: 10.3390/s21093103
FanD. P.TaoZ.Ge-PengJ.YiZ.GengC.HuazhuF.et al. (2018). Enhanced-alignment measure for binary foreground map evaluation. IJCAI.12, 698–704. doi: 10.24963/ijcai.2018/97
FanD.-P.ZhouT.JiG. P.ZhouY.ChenG.FuH.et al. (2020). Inf-net: automatic COVID-19 lung infection segmentation from CT images. IEEE Trans. Med. Imaging39, 2626–2637. doi: 10.1109/tmi.2020.2996645
FernandoN.McMeekinD. A.MurrayI. (2023). ‘Route planning methods in indoor navigation tools for vision impaired persons: a systematic review’, disability and rehabilitation. Assist. Technol.18, 763–782. doi: 10.1080/17483107.2021.1922522
HuangQ.LeiY.XingW.HeC.WeiG.MiaoZ.et al. (2022). Evaluation of pulmonary edema using ultrasound imaging in patients with COVID-19 pneumonia based on a Non-Local Channel attention ResNet. Ultrasound in Medicine & Biology [Preprint].48, 945–953. doi: 10.1016/j.ultrasmedbio.2022.01.023
HuX.YangK.FeiL.WangK. (2019). ACNET: attention based network to exploit complementary features for RGBD semantic segmentation. IEEE21, 1440–1444. doi: 10.1109/icip.2019.8803025
JeamwatthanachaiW.WaldM.WillsG. (2019). Indoor navigation by blind people: behaviors and challenges in unfamiliar spaces and buildings. Br. J. Vis. Impair.37, 140–153. doi: 10.1177/0264619619833723
KarimiD.DouH.WarfieldS. K.GholipourA. (2020). Deep learning with Noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal.65:101759. doi: 10.1016/j.media.2020.101759
KimH. W.LeeS.YangJ. H.MoonY.LeeJ.MoonW. J. (2023). Cortical Iron accumulation as an imaging marker for neurodegeneration in clinical cognitive impairment Spectrum: a quantitative susceptibility mapping study. Korean J. Radiol.24, 1131–1141. doi: 10.3348/kjr.2023.0490
KubotaM. (2024). Snap: smartphone-based indoor navigation system for blind people via floor map analysis and intersection detection. Proc. ACM Hum. Comput. Int.8, 1–22. doi: 10.1145/3676522
KuriakoseB.ShresthaR.SandnesF. E. (2021). Towards independent navigation with visual impairment: a prototype of a deep learning and smartphone-based assistant. Association for Computing Machinery.113–114. doi: 10.1145/3453892.3464946
KwakJ.SungY. (2021). DeepLabV3-refiner-based semantic segmentation model for dense 3D point clouds. Remote Sens.13:165. doi: 10.3390/rs13081565
LazebnikS.SchmidC.PonceJ. (2006). “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2169–2178.
LeeJ.YoonW.KimS.KimD.KimS.SoC. H.et al. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240. doi: 10.1093/bioinformatics/btz682
LiuY.ZhangZ.LiuX.LeiW.XiaX. (2021). Deep learning based mineral image classification combined with visual attention mechanism. IEEE Access9, 98091–98109. doi: 10.1109/access.2021.3095368
Martínez-CruzS.et al. (2021). An outdoor navigation assistance system for visually impaired people in public transportation. IEEE Access9, 130767–130777. doi: 10.1109/access.2021.3111544
MehtaS.AppleM. R. (2022). MobileViT: light-weight, general-purpose, and Mobile-friendly vision transformer. ICLR22:3. doi: 10.48550/arXiv.2110.02178
MeiY.FanY.ZhouY.HuangL.HuangT. S.ShiH. (2020). Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. arXiv. doi: 10.48550/arxiv.2006.01424
MessaoudiM. D.MenelasB. A. J.McheickH. (2020). Autonomous smart white cane navigation system for indoor usage. Technologies8:37. doi: 10.3390/technologies8030037
MisawaN.YamaguchiR.YamadaA.WangT.MatsuiC.TakeuchiK. (2024). Design methodology of compact edge vision transformer CiM considering non-volatile memory bit precision and memory error tolerance. Jap. J. Appl. Phys.63:03SP05. doi: 10.35848/1347-4065/ad1bbd
MujtabaG.MalikA.RyuE. (2022). LTC-SUM: lightweight client-driven personalized video summarization framework using 2D NCS. IEEE access10, 103041–103055. doi: 10.1109/access.2022.3209275
PetitO.ThomeN.RambourC.ThemyrL.CollinsT.SolerL. (2021). “U-net transformer: self and cross attention for medical image segmentation” in Machine learning in medical imaging. eds. LianC.et al. (Cham: Springer International Publishing), 267–276.
QiM.ZhangH. (2023). Dimensional emotion recognition based on two stream NCS fusion attention mechanism. Third International Conference on Sensors and Information Technology (ICSI 2023). (Eds.). KannanH.HemanthJ.. International Society for Optics and Photonics.
RonnebergerO.FischerP.BroxT. (2015). U-net: convolutional networks for biomedical image segmentation. Cham: Springer.
SandlerM.HowardA.ZhuM.ZhmoginovA.ChenL.-C. (2018). “MobileNetV2: inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520.
ShahP. A.HessahA. R. A. A.RayanH. M. A. A. (2023). Machine learning-based smart assistance system for the visually impaired. 139–144. doi: 10.1109/ITT59889.2023.10184257
SunS.YueX.ZhaoH.TorrP. H. S.BaiS. (2022). ‘Patch-based separable transformer for visual recognition. IEEE transactions on pattern analysis and machine intelligence22, 1–8. doi: 10.1109/tpami.2022.3231725
TaoY.GanzA. (2020). Simulation framework for evaluation of indoor navigation systems. IEEE access8, 20028–20042. doi: 10.1109/ACCESS.2020.2968435
ThakurP. S.ChaturvediS.KhannaP.SheoreyT.OjhaA. (2023). Vision transformer meets convolutional neural network for plant disease classification. Eco. Inform.77:102245. doi: 10.1016/j.ecoinf.2023.102245
TheodorouP.TsiligkosK.MelionesA.TsigrisA. (2022). An extended usability and UX evaluation of a Mobile application for the navigation of individuals with blindness and visual impairments indoors: an evaluation approach combined with training sessions. Br. J. Vis. Impair.42, 86–123. doi: 10.1177/02646196221131739
VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al. (2023). Attention is all youneed. arXiv. doi: 10.48550/arXiv.1706.03762
WangX.GirshickR.GuptaA.HeK. (2018). “Non-local neural networks,”in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803.
WangY.YiB.YuK. (2022). Exploration and research about key technologies concerning deep learning models targeting Mobile terminals. J. Phys. Conf. Series2303:012086. doi: 10.1088/1742-6596/2303/1/012086
WangZ.JianboW.PengfeiM.FuyongW.WenbaoD. (2019). “Evaluation for parachute reliability based on fiducial inference and Bayesian network,” in Proceedings of the 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA 2018). London: Atlantis Press.
WenF.WangM.HuX. (2023). DFAM-DETR: deformable feature based attention mechanism DETR on slender object detection. IEICE Trans. Inf. Syst.E106, 401–409. doi: 10.1587/transinf.2022edp7111
XiaZ.KimJ. (2023). Enhancing mask transformer with auxiliary convolution layers for semantic segmentation. Sensors23:20581. doi: 10.3390/s23020581
YeungM.SalaE.SchönliebC. B.RundoL. (2022). Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph.95:102026. doi: 10.1016/j.compmedimag.2021.102026
ZhangY.LiuH.HuQ. (2021) ‘TransFuse: fusing transformers and NCSs for medical image segmentation’, in BruijneM.deMedical Image Computing and Computer Assisted Intervention — MICCAI 2021. Cham: Springer International Publishing, 14–24
ZhangZ.GaoQ.LiuL.HeY. (2023). A high-quality Rice leaf disease image data augmentation method based on a dual GAN. IEEE access11, 21176–21191. doi: 10.1109/access.2023.3251098
ZhouC.ZhangX.ZhongY. (2024). Are transformers more suitable for plant disease identification than convolutional neural networks?. doi: 10.21203/rs.3.rs-4284240/v1
ZhouL.ZhuM.XiongD.OuyangL.OuyangY.ZhangX. (2023). MR image reconstruction via non-local attention networks. Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022). (Eds.) XiaoL.XueJ. International Society for Optics and Photonics.
ZhuL.KangZ.ZhouM.YangX.WangZ.CaoZ.et al. (2022). CMANet: cross-modality attention network for indoor-scene semantic segmentation. Sensors22:8520. doi: 10.3390/s22218520
ZhuZMengdeX.SongB.TengtengH.XiangB. (2019). “Asymmetric non-local neural networks for semantic segmentation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 593–602.

Summary

Keywords

indoor navigation, computer vision, markers, assistive technology, mobile devices

Citation

Triyono L, Gernowo R and Prayitno (2025) MoNetViT: an efficient fusion of NCS and transformer technologies for visual navigation assistance with multi query attention. Front. Comput. Sci. 7:1510252. doi: 10.3389/fcomp.2025.1510252

Published

10 February 2025

Edited by

Sokratis Makrogiannis, Delaware State University, United States

Reviewed by

Karisma Putra, Muhammadiyah University of Yogyakarta, Indonesia

Mosiur Rahaman, Asia University, Taiwan

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or replica in different boards is permitted, supplied the unique creator(s) and the copyright proprietor(s) are credited and that the unique publication on this journal is cited, in accordance with accepted tutorial follow. No use, distribution or replica is permitted which doesn’t comply with these phrases.

*Correspondence: Liliek Triyono, [email protected]

Disclaimer

All claims expressed on this article are solely these of the authors and don’t essentially signify these of their affiliated organizations, or these of the writer, the editors and the reviewers. Any product which may be evaluated on this article or declare which may be made by its producer will not be assured or endorsed by the writer.

Sources

ByNews Center

Abstract

1 Introduction

2 Methods

2.1 Transformer and NCSs

2.2 Image segmentation using transformer and NCSs

2.3 Lightweight networks

2.4 Mini-MobileNet-MobileViT network

2.4.1 Encoder sub-network

2.4.2 Decoder sub-network

2.4.3 MobileViT block

2.5 Multi query attention

2.6 Motivation

2.7 Loss function

2.8 Experiment setup

3 Results

3.1 Three-class ArUco marker labeling results

3.2 Component impact analysis

3.3 Parameter comparison

4 Discussion

4.1 Comparison of multi-scale features

4.2 Comparison of different combination MQA and FFM

5 Conclusion

Statements

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of curiosity

Generative AI assertion

Publisher’s observe

References

Summary

Share this:

By News Center

Related Post

Leave a Reply Cancel reply

You missed