ECoSS (CINEA/2022/OP/0019) is an open library containing a curated and continuously growing digital catalogue and an AI-Classifier of individual sound signatures from the marine underwater soundscape in shallow waters.
Why is it important to have a catalogue of sound signatures? Providing a way to assess the acoustic impact of human activities enables the identification of measures to manage them, reducing potential harm to marine life.
A Sound Signature is a unique acoustic pattern that can be used to identify and distinguish a particular sound source. It can be generated by analysing the frequency, amplitude, and temporal characteristics of a sound and comparing it to a known reference.
A Digital Catalogue of underwater sounds stores recordings for studying marine environments, human impact, and conservation. Created via recording devices and digital processing, it's a curated, growing resource for research and conservation.
In ECoSS project, the Digital Catalogue includes three main databases:
● A DB with of curated and validate (by experts) sound signatures
● a DB with sound signatures optimized for AI training
● a DB with a training and test set used for the training and testing of the developed AI models.
The system is open to receive contributions in order to enrich the Catalogue of Underwater Sound Signatures by the Contribute feature.
The AI supported classification of underwater sound signatures is accessible by the Classifier. This tool supports tagging unknown sounds and improving the accuracy of future classifications through collaborative contributions.
ECoSS employs three advanced AI architectures, all trained using the same benchmark - AudioSet - to ensure consistency and comparability in performance:
● EffAT
The EfficientAT architecture, developed by Florian Schmidt and Khaled Koutini, is designed to combine the accuracy of Transformers with the efficiency of CNNs for audio classification tasks. To achieve this, the architecture was trained on the large-scale Audioset dataset, which includes over 2 million audio clips spanning around 500 sound classes. The key to EfficientAT's performance lies in Knowledge Distillation, a technique where a complex, high-performing model (in this case, PaSST, a Transformer-based model) acts as a "teacher" to guide the learning of a smaller, more efficient "student" model, built on MobileNet, a CNN architecture. PaSST was first trained on Audioset to achieve a high level of accuracy. Then, through Knowledge Distillation, the student MobileNet model learned to replicate the Transformer’s behavior by using the teacher’s output (soft labels) rather than raw data alone. This process allowed EfficientAT to inherit the Transformer’s ability to capture complex relationships in the data while reducing computational complexity and resource usage. EfficientAT also incorporated Transformer - like attention mechanisms - such as Dynamic ReLU, Dynamic Convolutions, and Coordinate Attention - into the CNN structure. This combination enabled the CNN to replicate some of the advanced features typically found in Transformer models, such as focusing on the most relevant parts of the input data. As a result, EfficientAT models retain the high accuracy of Transformers but operate with the efficiency of CNNs, making them suitable for deployment in resource-limited environments, such as mobile and edge devices, where computational power and memory are constrained. This hybrid approach delivers a powerful solution for audio classification while optimizing for scalability and real-world application needs.
● VGGish
VGGish is a convolutional neural network (CNN) model trained by Google researchers for feature extraction from audio content. This model is based on the popular VGGNet architecture. VGGish converts audio input features into a semantically meaningful, high-level 128-D embedding which can be fed as input to a downstream classification model. The downstream model can be shallower than usual because the VGGish embedding scheme is more semantically compact than the original raw audio features. VGGish embeddings proved to be an effective data-driven alternative to well-established audio features like the Mel-Frequency Cepstral Coefficients (MFCC). With respect to the original VGG, the VGGish architecture consists of only 8 convolutional layers and a linear mapping layer at the end. The convolutional layers consist of convolutional, pooling, and activation layers. Each convolutional layer has 3x3 convolution filters trained to extract various features from the input audio spectrograms. The input size was changed to 96x64 to adapt for log-mel spectrogram audio inputs. The number of fully connected output nodes was decreased to 128 to get a more compact embedding of each audio clip. The 128 final embeddings were postprocessed before release by applying a PCA transformation (including whitening) as well as quantization to 8 bits per embedding element. To deal with marine data, a Support Vector Machine (SVM) classifier has been built and trained on top of VGGish embeddings, and its performance will be shown in the next section of this document.
Dealing with an audio clip of a generic duration, VGGish processes it by splitting it into non-overlapping 0.96s blocks and producing 128 numbers for each one. The typical choice is to classify each block independently and then add a data fusion stage if a single class is required as the final output. The VGGish model is available in TensorFlow and PyTorch and it can be easily imported and integrated by a simple Python interface.
● PaSST
PaSST (Efficient Training of Audio Transformers with Patchout) is an improved version of AST (Audio Spectrogram Transformer (AST), by Yuan Gong, Yu-An Chung, James Glass, the first convolution-free, purely attention-based model for audio classification. The key improvement of PaSST was the inclusion of the concept of Patchout, which consists in dropping parts of the input sequence when training, encouraging the transformer to perform the classification using an incomplete sequence. The pipeline begins with an audio spectrogram being fed into the model as input. The patch extraction and linear projection steps follow. Then, frequency and time positional encodings are added. Next, Patchout is applied, and the classification token is added. The sequence is then flattened and passed through d layers of self-attention blocks (d is the depth of the transformer). Finally, a classifier operates on the transformed classification token. During the Patchout step, when training, parts of the sequence are randomly dropped, reducing the sequence length and effectively regularizing the training process. During inference, the whole input sequence is presented to the transformer. The authors of the PaSST initially used pretrained weights from a base model trained on the ImageNet dataset. Leveraging these pretrained weights, they trained the PaSST model on the AudioSet dataset. We use this weight as a starting point to fine-tune the model using our database. Although the model was pretrained using 10-second segments, we simply crop the time positional encoding parameters, allowing us to take advantage of the knowledge learned from AudioSet.
The trained AI models are freely downloadable via GitHub.
Project Partners
Role | Partner | Nationality | Logo |
Coordinator | CTN Centro Tecnológico Naval y del Mar | Spain | |
Technical Coordinator | ETT | Italy | |
Partner | ICES International Council for the Exploration of the Sea | Denmark | |
Partner | Witteveen + Bos | Netherlands | |
Partner | SAES Sociedad Anónima de Electrónica Submarina | Spain | |
Partner | SHOM Service Hydrographique et Océanographique de la Marine | France | |
Partner | OnAIR | Italy | |
Partner | SMHI Swedish Meteorological and Hydrological Institute | Sweden |