Thesis

Ayesha Ehsan

Researcher

ChatGPT

Microsoft Excel

Analytics

Time Series Classification Using GAN-Augmented Data and Attention-Based

Convolutions

Author: Ayesha Ehsan 311745

Supervisors: Prof Dr. Dr. Lars Schmidt-Thieme

Nourhan Ahmed

Thesis submited for

Master of Science in Data Analytics

Wirtschaftsinformatik und Maschinelles Lernen

Stiftung Universität Hildesheim

Universitatsplätz 1, 31141 Hildesheim

Statement as to the sole authorship of the thesis:

Time Series Classification Using GAN-Augmented Data and Attention-Based Convolutions. I hereby certify that the master’s thesis named above was solely written by me and that no assistance was used other than that cited. The passages in this thesis that were taken verbatim or with the same sense as that of other works have been identified in each individual case by the citation of the source or the origin, including the secondary sources used. This also applies for drawings. sketches, illustration as well as internet sources and other collections of electronic texts or data, etc. The submitted thesis has not been previously used for the fulfilment of a degree requirements and has not been published in English or any other language. I am aware of the fact that false declarations will be treated as fraud.

26th July 2024, Hildesheim

Abstract

This thesis explores the potential of contrastive learning for enhancing the analysis of time series data, a domain where labels are often scarce or unavailable. By exploiting the distinctions between similar and dissimilar instances, our work aims to extract significant features from the time series data without the need for labelled examples. We build upon the TS2Vec methodology, extending its potential with an advanced encoder architecture that integrates attention-based causal dilated convolutions, thereby enrich- ing the contextual details captured within the data representations.

Considering the challenges associated with labelled data scarcity, we also examine the application of generative adversarial networks (GANs) to create synthetic data for diversity. Subsequently, our approach employs adaptive pooling to obtain representations of arbitrary sub-sequences within the time series. We validate our approach through empirical analysis, using bench- mark datasets from the UCR and UEA time series classification archives. Our results demonstrate notable improvements over TS2Vec, and Synth- TS2Vec outperforms current state-of-the-art methods in unsupervised time series representation.

Contents

1 Introduction 1 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5 2.1 Convolutional Neural Network (CNN) for Time Series . . . . . 5

2.1.1 Convolutions . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Dilated Convolutions . . . . . . . . . . . . . . . . . . . 6 2.1.3 Residual Block . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . 8 2.3 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . 9 2.4 Transformers for Time Series . . . . . . . . . . . . . . . . . . . 11

2.4.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 Multi-head attention . . . . . . . . . . . . . . . . . . . 12 2.4.3 Transformer Architecture . . . . . . . . . . . . . . . . . 12

2.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Traditional Data Augmentation . . . . . . . . . . . . . 15 2.5.2 Advanced Data augmentation techniques . . . . . . . . 15

3 Related Work 19 3.1 Classical methods for Learning Temporal Dynamics in Time

Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Fourier Transforms . . . . . . . . . . . . . . . . . . . . 19 3.1.2 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.3 Dynamic Time Warping (DTW) . . . . . . . . . . . . . 20

3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Self-supervised Learning . . . . . . . . . . . . . . . . . . . . . 22

3.4.1 Contrastive Learning . . . . . . . . . . . . . . . . . . . 23

4 Methodology 26 4.1 TS2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Generator . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.2 Critic (Discriminator) . . . . . . . . . . . . . . . . . . 31 4.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.4 Representation Encoder . . . . . . . . . . . . . . . . . 33 4.3.5 Hierarchical Contrasting . . . . . . . . . . . . . . . . . 35

5 Experiments 37 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . 37 5.1.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 38

5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Time Series Classification . . . . . . . . . . . . . . . . 39

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Training Time . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusion 49

A Architecture Details 58 A.1 GAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 58 A.2 Random Cropping . . . . . . . . . . . . . . . . . . . . . . . . 59

B Experimental Details 60

List of Figures

1.1 A general framework of time series representation learning (Tri- rat et al. 2024). . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 A general 1D convolutional neural network with two 1D con- volutions (Shenfield and Howarth 2020). . . . . . . . . . . . . 6

2.2 Stacked dilated convolutions in CNN (Oord et al. 2016). . . . 7 2.3 A residual Block (He et al. 2016) . . . . . . . . . . . . . . . . 7 2.4 RNN architecture (Dancker 2022). . . . . . . . . . . . . . . . 8 2.5 A basic LSTM architecture (Ingolfsson 2021). . . . . . . . . 10 2.6 Self-attention mechanism (left) and multi-attention mechan-

ism (right) (Vaswani et al. 2017). . . . . . . . . . . . . . . . 11 2.7 Transformer architecture (Vaswani et al. 2017). . . . . . . . 13 2.8 An example of augmentations applied on input data to gen-

erate new samples (Eldele, Ragab, Z. Chen, Wu, C.-K. Kwoh et al. 2023). . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.9 GAN architecture (Brophy et al. 2023). . . . . . . . . . . . . 17

3.1 Self-supervised contrastive learning (Witter 2023). . . . . . . 24

4.1 Architecture of TS2Vec (Yue et al. 2022). . . . . . . . . . . . 27 4.2 Strategies for selection of positive pairs (Yue et al. 2022). . . 28 4.3 Overall architecture of Synth-TS2Vec. The model consists of

two parts: (1) A Generative Adversarial Network to generate synthetic samples, (2) an encoder that learns representations of input time series instances through hierarchical contrastive loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 PCA plots of 5 UCR datasets on 50 epochs. Red denotes original and blue denotes synthetic data. . . . . . . . . . . . . 43

5.2 PCA plots of 5 UEA datasets on 50 epochs. Red denotes original and blue denotes synthetic data. . . . . . . . . . . . . 44

iii

5.3 t-SNE plots of learned embeddings on top 6 UCR datasets with the most test samples. Each class is represented by a different color. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 PCA plots of synthetic samples on ElectricDevices dataset from the UCR archive. . . . . . . . . . . . . . . . . . . . . . . 46

A.1 Random Cropping to create new contexts. . . . . . . . . . . . 59

B.1 Timestamp masking in Synth-TS2Vec. . . . . . . . . . . . . . 70

List of Tables

5.1 Full Results on first 125 UCR datasets . . . . . . . . . . . . . 42 5.2 Full Results on first 29 UEA datasets . . . . . . . . . . . . . . 42 5.3 Execution time of TS2Vec vs Synth-TS2Vec . . . . . . . . . . 46 5.4 Ablation Results on first 125 UCR datasets . . . . . . . . . . . 47 5.5 Ablation Results on first 29 UEA datasets . . . . . . . . . . . 48

A.1 Generator Architecture . . . . . . . . . . . . . . . . . . . . . . 58 A.2 Discriminator (Critic) Architecture . . . . . . . . . . . . . . . 58

B.2 A summary of the 128 UCR Univariate datasets (Dau et al. 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.6 Results of TS2Vec vs Synth-TS2Vec . . . . . . . . . . . . . . . 67 B.1 A Summary of the 30 UEA Multivariate datasets (Bagnall et

al. 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B.3 TS2Vec vs Synth-TS2Vec . . . . . . . . . . . . . . . . . . . . . 69 B.4 Default settings of lstmwgan-gp . . . . . . . . . . . . . . . . . 70 B.5 Default settings of Synth-TS2Vec . . . . . . . . . . . . . . . . 70

Listings

Chapter 1

Introduction

Time series data has been applied in various research domains, ranging from financial markets (Sezer, Gudelek and Ozbayoglu 2020), the Internet of Things (IoT) (Cook, Misirli and Fan 2019), to human activity on wearable devices (K. Chen et al. 2021; Gu et al. 2021), and healthcare services (Sun et al. 2020). Time series play an important role in data-driven decision-making and predictions. However, real-world data is often high-dimensional with intricate patterns, which is often challenging to analyse (Salinas, Flunkert, Gasthaus and Januschowski 2020).

Furthermore, the performance of deep learning frameworks, such as super- vised models, largely depends on labelled datasets which are difficult to ob- tain. Time series data is more challenging to label compared to image data as it contains noise and sparse patterns. In other words, the unavailability of labelled data poses a major constraint to effectively applying deep learning techniques (Trirat et al. 2024). Therefore, to leverage deep learning’s full potential in time-series analysis effectively, it is necessary to generate precise and comprehensive labels (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021). (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021).

Representation learning has recently become an essential technique to ex- tract complex patterns from raw time series without the need for manual feature engineering (Tonekaboni, Eytan and Goldenberg 2021; Trirat et al. 2024). These learned representations underpin various tasks like forecasting (Lim and Zohren 2021), classification (Ruiz et al. 2021), and anomaly detec- tion (Choi et al. 2021). Recent self-supervised models have excelled at ex- tracting latent representations from unlabeled data via pretext tasks (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021; Yue et al. 2022). Previous studies have employed various data augmentation techniques to acquire ro-

Figure 1.1: A general framework of time series representation learning (Trirat et al. 2024).

bust and generalizable representations, which is crucial for downstream tasks (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021; Tonekaboni, Eytan and Goldenberg 2021). The core concept behind data augmentation is to generate synthetic datasets that cover unexplored regions of the input space while preserving accurate information (Wen et al. 2020). Data augmentation has demonstrated efficacy across diverse domains such as computer vision (CV) (Donahue and Simonyan 2019; Radford, Metz and Chintala 2015) and natural language processing (NLP) (Young et al. 2018; Radford, Jozefowicz and Sutskever 2018). However, there has been relatively less emphasis on developing improved data augmentation techniques tailored specifically for time series data (Wen et al. 2020; Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021). Figure 1.1 illustrates the framework of representation learning for time series data.

Recent advancements in self-supervised learning methods, such as contrast- ive learning, either capture representations at specific granularities or rely on heuristic-based data augmentation techniques that may potentially dis- rupt the inherent temporal dependencies embedded within the data (Meng et al. 2023; Trirat et al. 2024). (Tonekaboni, Eytan and Goldenberg 2021) introduces a novel framework for contrastive learning tailored for intricate multivariate non-stationary time series, aiming to capture patterns at the timestamp level. Further, T-Loss (Franceschi, Dieuleveut and Jaggi 2019) uses a triplet loss to learn scalable time series representations, while TS- TCC (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021) engages in cross-view temporal comparison using augmented data pairs. TimeCLR (X. Yang, Z. Zhang and R. Cui 2022) is another contrastive learning framework that incorporates a feature extractor designed to learn invariant represent- ations through the minimization of similarity between two distinct views of an input sample.

In this thesis, we build upon the existing work of “TS2Vec: Towards Uni- versal Representation of Time Series” (Yue et al. 2022) to improve the gener-

alisation of time series representation learning. We have a two-fold primary goal: to expand the diversity of datasets and to enhance the performance of our model in classification tasks. One of our noteworthy achievements is integrating self-attention and using GAN for data augmentation, resulting in enhanced model performance. We carry out comprehensive experiments to validate the components used in our model for classification tasks. As a result, our research contributes towards a robust method to detect complex patterns in time series data and sets a new benchmark for future research.

The remainder of the thesis is structured into the following chapters: Chapter 2 gives the essential background information, which lays the foundation for the theoretical and contextual aspects of our research. Chapter 3 offers a comprehensive review of the relevant literature. In Chapter 4, we present the baseline model architecture and its building blocks, along with our pro- posed framework. Chapter 5 is dedicated to providing comprehensive details regarding our experimental setup and showing that Synth-TS2Vec outper- forms state-of-the-art on the UCR and UEA archives. We conclude our thesis in Chapter 6 and discuss future work on representation learning for time series classification.

1.1 Research Questions

We explore the following research questions in this thesis:

RQ1: How does combining attention mechanisms with dilated convolutions in our model improve its performance and facilitate its ability to capture and prioritize critical temporal patterns in time series data?

RQ2: What is the impact of applying Generative Adversarial Networks as an advanced data augmentation technique on the model’s ability to gener- alize across diverse time series data distributions, particularly in relation to the quality of generated synthetic data and its influence on model robustness?

RQ3: Does the proposed framework align with or exceed the performance benchmarks of current state-of-the-art time series representation learning models?

The aforementioned research questions form the basis for investigating and enhancing the existing representation learning model. By addressing these questions, our goal is to advance the state-of-the-art in time series analysis

and contribute to the development of more robust models for various real- world applications.

Chapter 2

Background

2.1 Convolutional Neural Network (CNN) for

Time Series

Convolutional neural networks (CNNs) are deep learning techniques that have been applied to computer vision (Sharma, Jain and Mishra 2018; Galvez et al. 2018) and natural language processing tasks (Hughes et al. 2017; Ouyang et al. 2015). Lately, CNNs have demonstrated significant perform- ance improvements in handling time series data due to their ability to capture temporal dependencies and extract meaningful features directly from raw data by applying convolutions across the temporal dimension (Bai, Kolter and Koltun 2018; Ismail Fawaz, Forestier et al. 2019). Temporal convolu- tional networks (TCNs), introduced by (Ismail Fawaz, Forestier et al. 2019) utilize casual dilated convolutions to capture long-range dependencies. The study by Wang (Zhiguang Wang, Yan and Oates 2017) presented a Fully convolutional network (FCN), which acts as a feature extractor for time series classification. Other convolution designs for TSC include Inception- Time (Ismail Fawaz, Lucas et al. 2020), inspired by the inception architec- ture (Szegedy et al. 2017), relies on an ensemble of five deep CNNs. In ad- dition, ROCKET (RandOm Convolutional KErnel Transform) (Dempster, Petitjean and Webb 2020) applies random convolutional kernels for efficient feature extraction and uses a linear classifier for prediction. For time series forecasting, WaveNet (Oord et al. 2016) employs a CNN infrastructure that exhibits robust performance. (Zeng et al. 2023) combined CNNs with trans- former architectures to capture both local and long-range dependencies in financial time series forecasting.

2.1.1 Convolutions

Convolutional neural networks (CNNs) analyze time series data using one- dimensional (1D) convolution filters. hese filters move throughout the input sequence’s temporal dimension, capturing patterns in feature maps and ex- tracting particular temporal properties. In time series, 1D convolution filters are uniquely effective, as opposed to the 2D filters used in image processing domain. CNNs abstract complex temporal features through the sequential layering of many convolutional layers, leading to a final layer that converts feature-rich, compressed data into a classification or forecast (Ismail Fawaz, Forestier et al. 2019).

Figure 2.1: A general 1D convolutional neural network with two 1D convo- lutions (Shenfield and Howarth 2020).

2.1.2 Dilated Convolutions

Dilated convolutions, also called causal dilated convolutions, provide an ex- ponential increase in the receptive field, enabling networks to capture broader contextual information on a global scale. Dilated convolutions apply a kernel to the input at predetermined intervals as opposed to regular convolutions, which traverse continuously across the feature map. This flexibility improves the ability of networks to capture complex patterns and multi-scale informa- tion while preserving computational efficiency by enabling them to reach huge receptive fields without significantly increasing model parameters (Oord et al. 2016).

Figure 2.2: Stacked dilated convolutions in CNN (Oord et al. 2016).

2.1.3 Residual Block

Residual networks (ResNets) represent a significant advancement in convo- lutional neural network (CNN) architectures. The architecture proposed by (He et al. 2016) aims to address the difficulties encountered during the train- ing of very deep neural networks. A core element of this architecture, as shown in Figure 2.3, is the residual block. Deep networks often suffer from vanishing gradients, a problem where gradients become smaller as they propagate back through the layers, which results in dropping a model’s accur- acy. To handle this, ResNets use ‘residual connections,’ also known as ‘skip connections,’ which allows a network to focus on learning ‘residual mappings’ rather than learning the full target mappings directly. (He et al. 2016). These residual connections enhance the backward flow of gradients and address the vanishing gradient issue, which enables the successful training of much deeper networks.

Figure 2.3: A residual Block (He et al. 2016) .

2.2 Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are state-of-the-art deep learning tech- niques to model sequential data. Unlike traditional neural networks, which independently process the inputs, RNNs keep memory of past information, enabling them to establish a temporal link between observations. This qual- ity is beneficial in domains such as natural language processing (NLP), time series, and speech recognition, which solves the memory problem in deep neural networks by adding dependency between previous and current obser- vations (Dancker 2022).

Figure 2.4: RNN architecture (Dancker 2022).

In an RNN, connections between units form a directed cycle, resulting in an internal state that stores temporal data. This architecture enables the network to display dynamic temporal behaviour by incorporating its own output from past steps as new input for the current step. At each time step t, an RNN receives a new input xt, combines it with the previous hidden state ht−1, and computes the current hidden state ht (Poudel 2023). The hidden state acts as the memory of the network, carrying information throughout the processing of the sequence. Figure 2.4 illustrates how the RNN captures complicated temporal connections in data, making it a great tool for sequen- tial tasks.

Recurrent neural networks are trained using backpropagation through time (BPTT), an adaptation of the traditional backpropagation algorithm. BPTT modifies the traditional technique by unfolding the network through time and adjusting weights based on the error computed at each time step, (Poudel

2023). During training, RNNs sometimes encounter vanishing or exploding gradients, resulting in gradients being too small to be helpful or too big and unstable (Poudel 2023). These issues impact the network’s capacity to learn long-term dependencies, which can significantly affect performance.

The progress of Recurrent Neural Networks (RNNs) has greatly increased their ability to represent complex temporal patterns and make predictions. BN-RNNs, introduced by (Laurent et al. 2015), leverage Batch Normaliza- tion along the time axis to enhance training stability and speed up network convergence. This strategy investigates and uses interrelated patterns across different sequences, frequently producing more accurate forecasts than estab- lished methods such as ARIMA. LSTMs, with their unique gating mechan- isms, are specifically designed to meet the difficulty of learning long-range dependencies, thus boosting the model’s ability to deal with complex time series data (Salinas, Flunkert and Gasthaus 2019).

2.3 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are a special kind of recurrent neural network (RNN) well-known for their ability to capture long-term de- pendencies within time series data. LSTMs are distinguished from RNNs due to a unique characteristic of LSTMs, which is a cell state. The cell state improves memory retention over longer sequences in LSTMs. It is particu- larly crucial for solving the vanishing gradient issue encountered by RNNs, which causes them to lose information when handling long sequences (Olah 2015). Figure 2.5 shows the architecture of an LSTM, which consists of three sperate gates.

Each LSTM unit is built around a cell, which serves as the central com- ponent for processing information. The flow within the cell is controlled by three distinct gates: the forget gate decides what prior information to discard, the input gate chooses new data to update the cell state, and the output gate determines the information to be carried forward. These gates control the amount of influence on the cell state by using a sigmoid function and produce a number between zero and one. A value of zero means entirely blocking that piece of information, whereas a value of one signals its full integration into the state (Olah 2015).

Figure 2.5: A basic LSTM architecture (Ingolfsson 2021).

An LSTM network works in the following way: The forget gate (ft) first determines what information from the previous state (Ct−1) should be dis- carded, using inputs from the prior hidden state (ht−1) and the current input (xt). The input gate (it) then evaluates which new information is significant to be stored in the cell state. This is followed by the creation of a candidate vector (C̃t) via a tanh layer. The cell state is then updated to a new state (Ct) by amalgamating the retained information from the previous state and the new candidate values. In the end, the output gate (Ot) decides on the portions of the cell state that should be passed through as the output. After processing the cell state with a tanh function, the output is filtered by the sigmoid gate, yielding the final output hidden state (ht) (Olah 2015).

LSTMs have become significant in time-series forecasting. The work on this domain includes the Bidirectional Long Short-Term Memory (BiL- STM) architecture discussed in “Short-Term Load Forecasting Based on Deep Learning Bidirectional LSTM Neural Network” (Cai et al. 2021). BiLSTM expands upon traditional LSTMs by processing sequences in both forward and backward directions, a feature that enriches the model with a more com- prehensive context, thus significantly benefiting short-term load forecasting in the energy sector. The model is made up of a deep, multi-layered BiL- STM structure that discovers patterns in historical load data, and a feedback mechanism complements it that is designed to manage the temporal depend- encies in electricity usage. Moreover, ConvLSTM (Shi et al. 2015) presents an evolution in LSTM design by using convolutional networks to capture spatio- temporal dynamics. This hybrid architecture refines the task of precipitation forecasting by utilising both spatial and temporal information, enhancing the precision of rainfall prediction in localised areas over brief periods.

2.4 Transformers for Time Series

The Transformer architecture strongly relies on the attention mechanism, en- abling the model to capture long-range dependencies among input elements. Transformers have achieved state-of-the-art results across a wide range of natural language processing tasks (Vaswani et al. 2017; Devlin et al. 2019) and have shown good performance in visual recognition (Dosovitskiy et al. 2021). The self-attention mechanism calculates queries (Q), keys (K), and values (V) representations from the input data and then combines them to produce the output (Vaswani et al. 2017). Multi-head attention advances self- attention by incorporating multiple self-attention layers operating in parallel, with the outputs combined linearly to produce the final multi-head attention output (Vaswani et al. 2017).

Figure 2.6: Self-attention mechanism (left) and multi-attention mechanism (right) (Vaswani et al. 2017).

2.4.1 Self-attention

The self-attention mechanism calculates dot products between queries (Q) and keys (K), normalizes them using softmax to assign weights to the values (V), and computes the output matrix (as shown in Eq. 2.1).

Attention(Q,K, V ) = softmax

( QKT

√ dk

) V (2.1)

2.4.2 Multi-head attention

Multi-head attention divides self-attention into several heads (h), each with its own set of learning weights (WQ

i , WK i , W V

i ) as shown in 2.2. The different attentive outputs from these heads are then combined into a single compre- hensive representation by concatenating the output and then undergoing a final linear transformation with the weight matrix WO. This merges these multiple attentive outputs into a single comprehensive representation. Thus, multihead attention enables a model to focus on diverse parts of the input sequence at the same time, resulting in improved representation learning and capturing intricate patterns.

Multi-Head(Q,K, V ) = Concat(head1, . . . , headh)W O

headi = Attention(QWQ i , KWK

i , V W V i )

(2.2)

2.4.3 Transformer Architecture

The Transformer architecture consists of two components: an encoder and a decoder, which are stacks of N identical layers (Vaswani et al. 2017). Figure 2.7 shows the overall transformer architecture.

Encoder: The encoder layers convert input sequences into contextualized representations. Each encoder layer consists of two sub-layers: a multi-head attention mechanism and a position-wise feed-forward network. To improve learning, residual connections followed by layer normalization (J. L. Ba, Kiros and Hinton 2016) are used after each sub-layer (Vaswani et al. 2017).

Decoder: The decoder produces the final output sequence by interpret- ing abstract representations from the encoder. It generates the sequence in an autoregressive fashion by using a masked self-attention technique to guar- antee that the predictions for every token are conditional only on tokens that come before it. In addition, it has a cross-attention sub-layer that incorpor- ates the encoder’s context, enhancing the prediction process with detailed sequence information. Each layer of the decoder is followed by a position- wise, fully connected feed-forward network. Each sub-layer also has residual connections and layer normalisation processes (J. L. Ba, Kiros and Hinton 2016). The decoder outputs a sequence of tokens that are projected through a linear layer and probability-distributed using a softmax function, capit- alising on the nuanced context and sequential data that has been encoded (Vaswani et al. 2017).

Figure 2.7: Transformer architecture (Vaswani et al. 2017).

Positional encoding plays a critical role in Transformer models by giving them the ability to acknowledge the sequential arrangement of input tokens. This technique introduces information regarding the positions of tokens dir- ectly into their embeddings, maintaining the order of the sequence, which is not naturally captured by the initial embedding process. The design uses sinusoidal functions for positional encoding to embed position-dependent sig- nals in token representations. The mathematical expression for the positional encoding of a token located at position pos and dimension i is given in Equa- tion 2.3:

PE(pos,2i) = sin ( pos/100002i/dmodel

) PE(pos,2i+1) = cos

( pos/100002i/dmodel

) (2.3)

Here, dmodel represents the dimensionality of the token embeddings. The choice of sinusoidal functions anchors the encoding in a geometric progres- sion, allowing the model to differentiate relative and absolute token positions

across the sequence. By enabling positional encodings for token pos + k to be represented as a linear function for token pos, Transformer can encode and leverage sequential relationships among the input tokens (Vaswani et al. 2017).

In the past few years, transformers have made significant achievements in time series. They are capable of handling complex temporal relationships and managing variable-length sequences by employing self-attention. Time series forecasting models, like LogTrans (S. Li et al. 2019) and PatchTST (Nie et al. 2023), make architectural modifications for efficiency and inter- pretability. TARnet (Chowdhury et al. 2022) uses techniques to improve model performance for classification tasks. For anomaly detection, TranAD (Tuli, Casale and Jennings 2022) leverages a dual transformer architecture to identify subtle discrepancies. Overall, these methods show the flexibility of transformers to handle intricacies within time series data.

2.5 Data Augmentation

Several challenges prevent predictive models from being trained effectively in time series modelling. Among these hurdles are privacy concerns and the constrained availability of large, diversified datasets, as highlighted by (Ig- lesias et al. 2022). Given the scarcity and sensitivity of time series data, data augmentation stands out as a strategic approach for synthesizing new patterns. The main goal of data augmentation is to prevent overfitting and improve the generalisation of models, which are trained on limited datasets (Iwana and Uchida 2021)

While conventional data augmentation techniques, such as cropping and col- our transformations, have proven effective in the domain of computer vision (CV), their direct applicability to time series data is inherently limited. Due to the sequential and temporal nature of time series data, using naive ap- proaches may disturb the natural temporal sequences. It is vital to invest- igate augmentation techniques for temporal data that respect and preserve these essential sequential patterns (Yue et al. 2022; Iwana and Uchida 2021). The subsequent sections look into several data augmentation techniques de- signed for time series representation.

2.5.1 Traditional Data Augmentation

Transformation-based data augmentation strategies modify the original time series data to create new data samples (Iwana and Uchida 2021). The data augmentation of a single time series sample is illustrated in Figure 2.8

Jittering

Jittering introduces Gaussian noise into the input time series. This method introduces subtle fluctuations yet ordinarily preserves the core structure of the data sequence. Adding jittering often leads to the generalization of neural networks and improves performance in downstream tasks. It acts as a coun- termeasure against time series drift, which arises due to varying data distri- butions (Iwana and Uchida 2021; Nikitin 2024).

Scaling

Scaling time series data involves adjusting the magnitude of the data by multiplying each sample with a random scalar value. This scalar is typically drawn from a Gaussian distribution, which enables controlled fluctuations in magnitude without affecting the length of the sequence (Iwana and Uchida 2021).

Permutation

Permutation introduces new patterns patterns to a time series by changing the sequence of its segments. To generate permutations, a time series is di- vided into ’N’ uniform segments and then randomly shuffled. This technique should be used with caution as it may break time series data’s fundamental temporal dependencies (Iwana and Uchida 2021).

Magnitude warping

Magnitude warping alters the amplitude of a time series by multiplying its data points with values obtained from a smoothly varying curve, such as a cubic spline function.This approach transforms each sample in the dataset by magnitude, resulting in unexpected yet realistic data variations (Iwana and Uchida 2021; Nikitin 2024).

2.5.2 Advanced Data augmentation techniques

Expanding on traditional augmentation methods, advanced data augmenta- tion techniques utilize generative models like generative adversarial networks

Figure 2.8: An example of augmentations applied on input data to generate new samples (Eldele, Ragab, Z. Chen, Wu, C.-K. Kwoh et al. 2023).

(GANs) and variational autoencoders (VAEs) to enhance the augmentation process. These methods create synthetic examples that not only increase dataset size but also add detailed complexity, which better mirrors the di- versity seen in original data. Overall, generative models offer a powerful way to enhance dataset diversity and improve the robustness of machine learning models.

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) are a family of generative models, first introduced by Ian Goodfellow (Goodfellow et al. 2014) in 2014. The framework is composed of two competing neural networks: a generator (G) and a discriminator (D). The generator G takes a random noise z as input, which exists within an r-dimensional real space denoted as ∈ Rr. The goal of G is to produce synthetic data that mimics the distribution of the original data. In contrast to G, the discriminator D distinguishes if the data gener- ated by G resembles real data or is fake.

The generator G is designed to maximize the error rate of the discrim- inator D by producing convincing synthetic data. On the other hand, the discriminator D is optimized to minimize this error rate, boosting its capa- city to classify data as real or fake. Figure 2.9 depicts a simplified version of the GAN model, where G and D engage in a two-player minimax game gov- erned by the value function V (G,D). Within this framework, D(x) denotes the probability that a given data point x is sampled from the real data distri- bution rather than being produced by generator G. Equation 2.4 depicts the adversarial process in which G attempts to minimise the value function, for- cingD to make classification errors, whileD optimises its predictive accuracy by maximising the same value function (Brophy et al. 2023).

Figure 2.9: GAN architecture (Brophy et al. 2023).

min G

max D

V (G,D) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))] (2.4)

Some of the challenges faced by GANs include vanishing gradients, which stop the learning process because the discriminator becomes too good, and mode collapse, which leads to the generator producing very similar samples. To solve mode collapse and vanishing gradient problems in GANs, a frame- work known asWasserstein GAN (WGAN) was introduced, which uses Wasser- stein distance to encourage better flow of gradients (Arjovsky, Chintala and Bottou 2017). Furthermore, utilizing the Kullback-Leibler divergence (Zheng- wei Wang, She and Ward 2021) aids in assessing differences between data distributions with greater precision, enhancing GAN training, and promot- ing diverse data generation (Brophy et al. 2023).

Generative Adversarial Networks (GANs) have expanded their functional- ity beyond their original purpose in computer vision, showing versatility in handling natural language and producing audio. Their expertise in time series analysis includes removing noise from corrupted signals, reducing pri- vacy risks, and improving datasets through the creation of synthetic data (Brophy et al. 2023). Following are few GAN architectures applicable to time series data.

• TS-GAN (Z. Yang, Y. Li and Zhou 2023): The TS-GAN model is a Generative Adversarial Network with LSTM-based architectures de- signed for augmenting time-series health data from sensors. It features an LSTM generator for creating realistic synthetic data and an LSTM discriminator enhanced with a Sequential-Squeeze-and-Excitation mod- ule for better feature discrimination, utilizing gradient penalty for stable

training. The goal is to generate high-fidelity data to improve the per- formance of deep learning-based classification models in healthcare.

• TimeGAN(Yoon, Jarrett and Van der Schaar 2019): TimeGAN is com- posed of a typical GAN generator and discriminator components with two additional networks: an encoder and a recovery network. In or- der to preserve the features of the sequence, the encoder embeds the time series input into a latent space, capturing temporal dependen- cies. The recovery network then reconstructs the input data from this latent representation. The generator uses the latent embeddings to produce synthetic time series that the discriminator then attempts to distinguish from real sequences. A supervised loss function explicitly ensures that the generator learns the sequence of temporal transitions in the data, allowing TimeGAN to create time series that are not only plausible but also temporally consistent with the original data.

• Conditional Sig-Wasserstein GAN: The Conditional Signature Wasser- stein GAN (Sig-WGAN) (Liao et al. 2023) improves data augmentation for time series data with long temporal dependencies. It introduces a novel Sig-Wasserstein metric for the discriminator, which differentiates between real and synthetic data. It eliminates the need for a complic- ated discriminator network and simplifies the training process. Sig- WGAN successfully captures the complexities of time series data by employing an autoregressive feedforward neural network (AR-FNN) as the generator. It outperforms TimeGAN and has also shown promising results in predicting stock market prices and volatility.

Chapter 3

Related Work

In this chapter, we examine the various approaches that have influenced time series analysis over the years. We begin by introducing conventional methods for capturing temporal patterns. In the second half of the chapter, we will look at recent deep learning approaches that have increased representation learning capabilities. This thesis identifies the obstacles inherent in time series analysis, such as high dimensionality, changing sampling intervals, and noise. Furthermore, we provide some of the existing literature relevant to our study objectives, focusing on how deep learning networks extract latent feature representations from time series data.

3.1 Classical methods for Learning Temporal

Dynamics in Time Series

The traditional methods for time series analysis provide the framework for understanding and predicting temporal processes. We examine important approaches, their consequences, and how they helped to develop contempor- ary data science methods.

3.1.1 Fourier Transforms

The Fourier Transform decompasses time series into different frequency com- ponents, enabling the detection of periodic data, hidden patterns and trends. This approach is especially useful for examining signals with repetitive pat- terns. In particular, Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) algorithms can efficiently process large-scale time series datasets (Shkulov 2023).

3.1.2 ARIMA

In time series forecasting, the AutoRegressive Integrated Moving Average (ARIMA) model is a well-established technique and this methodology was presented by Box and Jenkins in 1970 (Box et al. 2015). In an ARIMA model, future values of a variable are a linear combination of past observa- tions coupled with random errors (Ariyo, Adewumi and Ayo 2014). It can be written as:

yt = ϕ0+ϕ1yt−1+ϕ2yt−2+. . .+ϕpyt−p+εt+θ1εt−1−θ2εt−2−. . .−θqεt−q (3.1)

Each parameter in the ARIMA(p, d, q) model, where ‘p‘ represents autoregressive terms, ‘d‘ signifies differencing order, and ‘q‘ indicates moving average terms, contributes to the model’s ability to fit a wide spectrum of time series data.

Due to the versatility and robustness in modelling a wide range of time series behaviours, ARIMA models have been widely employed in sectors such as economics and finance for forecasting stock prices and economic indicators. In (G. Zhang 2003) a hybrid approach exploits the properties of both AR- IMA and ANN models to model the complex autocorrelation structures in the data. ARIMA model has also achieved strong performance in short-term stock price prediction (Ariyo, Adewumi and Ayo 2014).

3.1.3 Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is known for its ability to evaluate the op- timal correspondence between two time-dependent sequences. By perform- ing non-linear mappings, DTW proficiently aligns these sequences—a process that allows for accurate comparison even when there are variations in timing or speed between them (“Dynamic Time Warping” 2007). With applications ranging from sequence classification (“Flexible Dynamic Time Warping for Time Series Classification” 2015) to economic forecasting (L. Wang and Koni- usz 2022) and the identification of anomalies within time-series data (Duy and Takeuchi 2023), this technique has established itself as an essential tool for temporal sequence comparison.

In the following sections, we divide time series representation learning meth- ods into three categories.

3.2 Supervised Learning

Supervised learning is the process of training models on datasets using pre- defined labels to identify relationships between input variables and the target labels. For example, a linear regression model predicts housing prices by con- sidering attributes such as size and location, whereas a neural network can differentiate between images of handwritten digits. The models adjust their parameters through error minimization to utilize learned patterns on new data (Y. Wang, Z. Cui and Ke 2023)

Supervised learning focuses on training an encoder fe that parameterizes features, mathematically expressed as fe : RT×V → RR×F . This encoder aims to transform the raw input data into a refined feature space, simplify- ing the task of classification or regression. This process reduces the necessity of manual feature engineering and allows the model to adjust to complex data patterns. Recent developments in supervised learning have adopted innovative loss and objective functions that improve the handling of incom- plete time series and process diverse data types, such as visual and auditory signals. However, the scope of supervised learning in the context of universal representation is limited due to the scarcity of extensive labelled datasets and the generalizability constraints of current models (Trirat et al. 2024).

The limitations of supervised learning have sparked greater interest in un- supervised and self-supervised learning methods. These approaches These methods help in creating oneself. generated labels or utilize pretext tasks, reducing the need for large manually labelled datasets. This paradigm shift expands the prospects for applying representation learning to a wider range of fields (Trirat et al. 2024).

3.3 Unsupervised Learning

By training an encoder, represented as fe, to extract features from a dataset without labels, D = {Xi}Ni=1, unsupervised learning is distinguished from supervised learning. This technique depends on unsupervised tasks such as data reconstruction to adjust fe, and the encoder’s performance is evaluated by how well it can reproduce this data using an unsupervised loss function. One primary benefit of unsupervised learning is that it is not dependent on annotated datasets, which makes it a flexible and cost-effective solution, es- pecially when labelled data is inaccessible or expensive (Trirat et al. 2024).

Reconstruction-based techniques are widely used in unsupervised learning and make use of algorithms such as autoencoders or sequence-to-sequence models. These methods aim to reconstruct full sequences or sequence seg- ments from raw time series data by training in conjunction with a decoding module (Trirat et al. 2024). Highlighted below are a few notable methods for unsupervised representation learning:

• TimeNet (Malhotra et al. 2017): By employing a multi-layer recurrent neural network, TimeNet uses a sequence-to-sequence framework with an encoder-decoder architecture to reconstruct time series data. The RNN encoder processes variable-length sequences, converting them into a stable, fixed-size vector representations, subsequently used by the decoder to produce output sequences. The RNN encoder acts as a feature extractor, providing reliable embedding vectors, crucial for wide range of tasks such as document classification and time series analysis.

• Ti-MAE (Z. Li et al. 2023): Ti-MAE uses random masking on time series embeddings to achieve targeted reconstructions with fine-grained accuracy using an autoencoder architecture. This method of adjusting masking ratios allows it to effectively handle various prediction scen- arios, improving accuracy in forecasting tasks and effectively handling distribution shifts. In addition, this approach of varying masking ra- tios differentiates it from other transformer-based models. (Z. Li et al. 2023).

• SimMTM (Dong et al. 2023): SimMTM presents a masked model- ling approach that synthesizes information across related but masked sequences to optimise the reconstruction of masked periods. By em- phasizing the local manifold structure, it improves the understanding of time dynamics beyond basic reconstruction from complete data. In order to strengthen the integrity of the extracted features, the model is further improved using a constraint loss function that encourages the consistency of learned representations against the neighbourhood structure of the manifold (Dong et al. 2023).

3.4 Self-supervised Learning

Self-supervised representation learning relies on unlabelled data for training. It employs self-supervised signals, known as pseudo labels, to train the loss function on dataset D = {(Xi, ŷi)}Ni=1, rather than learning through direct input [(Trirat et al. 2024)].This approach is especially appealing given the

high expenses of manual labelling. Below are few self-supervised learning frameworks:

• TST (Zerveas et al. 2021): The TST framework uses a transformer encoder to extract dense vector representations of multivariate time series, which is achieved by employing an input denoising objective. It differs from traditional denoising autoencoders, which are designed to reconstruct the entire input under Gaussian noise. Instead, it masks out a few elements that the model predicts during unsupervised pre- training. The use of masking in TST helps the model understand past and future data points for each variable as well as the simultaneous val- ues of other variables, allowing it to understand complex relationships in the time series (Zerveas et al. 2021).

• Wave2vec (Baevski et al. 2020): The Wave2vec framework consists of a multi-layer convolutional feature encoder which processes raw audio input to extract latent speech representations. These representations are then fed into a context network composed of Transformer layers, designed to contextualize the features by modeling the temporal rela- tionships within the speech (Baevski et al. 2020).

3.4.1 Contrastive Learning

Contrastive learning is a machine learning technique that compares ’posit- ive’ and ’negative’ instance pairs to derive meaningful representations. The guiding principle is that similar instances should come closer in the embed- ding space, while dissimilar ones should be pushed apart (Buhlurl n.d.). An illustration of how positive and negative pairs are placed in the latent space post-augmentation can be seen in Figure 3.1.

Self-supervised contrastive learning (SSCL) follows this concept by com- bining unlabelled data with pretext tasks. These tasks often involve gen- erating the augmented duplicates of data points, making different instances for the model to compare and analyze. This method enables SSCL to cap- ture complicated semantic information. A notable method in this domain is SimCLR, which employs contrastive loss to maximize the distance between augmented versions of the same data point while minimizing it between dif- ferent ones, enabling the extraction of useful representations (T. Chen et al. 2020). SimCLR further refines learning by inserting a trainable non-linear transformation between the extracted features and the contrastive loss for enhanced results (T. Chen et al. 2020).

Figure 3.1: Self-supervised contrastive learning (Witter 2023).

Several contrastive learning models use specialized loss functions, like triplet loss (Franceschi, Dieuleveut and Jaggi 2019)or contrastive loss (T. Chen et al. 2020; Tonekaboni, Eytan and Goldenberg 2021; Yue et al. 2022), to fine-tune the embedding space. The following items discuss advanced contrastive learning architectures for time series data:

• TNC (Tonekaboni, Eytan and Goldenberg 2021): TNC extracts rep- resentations from multivariate, non-stationary time series by using the local consistency of the data. It identifies and exploits temporal neigh- bourhood segments with consistent statistical characteristics and en- sures that signals within these neighbourhoods are closely aligned in representation space while those from distinct neighbourhoods are sep- arated. This debiased contrastive approach precisely captures changes in temporal dynamics. The main advantage of this framework lies in its unsupervised nature, allowing structured data analysis and minim- izing biases through Positive Unlabelled Learning in contrastive loss calculation.

• T-Loss (Franceschi, Dieuleveut and Jaggi 2019): T-Loss presents an unsupervised approach for crafting general-purpose representations of multivariate time series, with an emphasis on accommodating their varying and potentially great lengths. This method exploits an en- coder with causal dilated convolutions, coupled with an unsupervised triplet loss mechanism that incorporates time-based negative sampling. A random sub-series is chosen as an anchor during training, and the sub-series overlapping with the anchor is considered positive due to

their temporal proximity. On the contrary, a sub-series from a differ- ent time series is labelled as negative. T-Loss aims to achieve robust representations by utilizing the encoder’s ability to handle time series of different lengths.

• TS-TCC (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021): Time- Series Representation Learning via Temporal and Contextual Contrast- ing (TS-TCC) is a novel unsupervised framework that leverages tem- poral and contextual contrasting for time-series representation learn- ing. Initially, the framework applies weak and strong augmentations to raw time-series data to craft two correlated views. Following this, a temporal contrasting module undertakes a cross-view prediction task, enforcing the model to learn robust features that are immune to various perturbations. Then, the contextual contrasting module uses these lat- ent features to distinguish contexts within the same sample from those across different samples, thereby refining the representations. Overall, TS-TCC proposes a unique, contrastive learning approach that ensures acquiring strong, discriminative features in an unsupervised manner for time-series data.

• TS2Vec (Yue et al. 2022): TS2Vec, compared to state-of-the-art, im- proves time series representation learning across various semantic levels. Unlike traditional methods, it uses hierarchical contrastive learning across augmented context views, ensuring robust contextual repres- entations for each timestamp. Through the use of max pooling across timestamps, the model gains a full representation for a specific sub- series. This approach allows TS2Vec to capture contextual details at multiple resolutions within the data and results in highly detailed rep- resentations that can adapt to any level of granularity.

Chapter 4

Methodology

This chapter presents TS2Vec (Yue et al. 2022), which serves as the found- ation for this thesis. Before delving into the proposed enhancements, i.e., Synth-TS2Vec, it is important to thoroughly review TS2Vec: Towards Uni- versal Representation of Time Series.

4.1 TS2Vec

Current contrastive learning approaches for time series, such as TNC (Tonek- aboni, Eytan and Goldenberg 2021) and TS-TCC (Franceschi, Dieuleveut and Jaggi 2019), seek to encapsulate overall series characteristics at the in- stance level. In contrast, methods like T-Loss (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021) at the temporal level aim to discern fine-grained patterns within specific time intervals. However, these methods often over- look multi-scale contextual details, thereby excluding scale-invariant aspects that are critical for robust generalization. Moreover, unsupervised time series representation techniques often adopt invariance assumptions from the fields of computer vision (CV) and natural language processing (NLP), which may not be suitable for the distinctive properties of time series data. Specifically, standard operations in other domains, such as cropping, can significantly al- ter both the distribution and semantics of time series data (Yue et al. 2022).

To address the above challenges, (Yue et al. 2022) introduced TS2Vec, designed for hierarchical representation learning across different semantic levels. By employing a hierarchical contrastive method, TS2Vec differentiates between positive and negative examples in both instance-wise and temporal dimensions. This method allows for detailed representations to be extracted for sub-series, improving contextual understanding at multiple resolutions.

Figure 4.1: Architecture of TS2Vec (Yue et al. 2022).

Moreover, it guarantees consistent representations of equivalent sub-series in the context of different augmentations. The framework of TS2Vec is shown in Fig 4.1. It can be seen that two randomly cropped segments of an input time series are fed into the encoder. The encoder, fθ comprises three modules:

Input Projection Layer: This is a fully connected layer that trans- forms the input observation at timestamp t into a high-dimensional latent vector for each instance.

Timestamp Masking: This component creates augmented context views by masking latent vectors at random timestamps from the previous layer.

Dilated Convolutions: The dilated Convolutional Neural Network (CNN) module acts as a feature extractor, extracting contextual representations at each timestamp through a series of residual blocks. This enables the model to understand both short-term and long-term patterns (Yue et al. 2022).

The construction of positive pairs in contrastive learning enables a model to learn useful representations by training it to distinguish between sim- ilar (positive) and dissimilar (negative) examples. A key feature that sets TS2Vec apart from other models is the construction of its positive pairs. Prior studies have used approaches such as sub-sequence consistency (Franceschi,

Dieuleveut and Jaggi 2019), temporal consistency (Tonekaboni, Eytan and Goldenberg 2021), and transformation consistency (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021). However, the above methods rely on strong data distribution assumptions, which makes them unsuitable for time series data (Yue et al. 2022). The authors of TS2Vec presented a novel approach known as contextual consistency. This method considers representations at the same instant in two augmented contexts as positive pairs. A context is created by applying timestamp masking and random cropping to the input series during training while keeping the original timestamps intact. This approach preserves the magnitude and enhances the robustness of represent- ations by enforcing timestamp reconstruction across different contexts (Yue et al. 2022). The proposed contextual consistency for positive pair selection is shown in Fig 4.2.

Figure 4.2: Strategies for selection of positive pairs (Yue et al. 2022).

The hierarchical contrasting framework utilizes two losses to obtain the contextual representations at multiple scales within a given time series. The Temporal Contrastive Loss considers the representations at the same timestamp from augmented contexts as positives, whereas the representations of samples within the same time series at different timestamps are considered as negat- ives (Yue et al. 2022). The Temporal Contrastive Loss is described as:

ℓ (i,t) temp = − log

exp ( ri,t · r′i,t

)∑ t′∈Ω

( exp

( ri,t · r′i,t′

) + 1[t̸=t′] exp (ri,t · ri,t′)

) , (4.1)

where i is the index of the input sample xi, and t is the timestamp. ri,t and r′i,t are the augmented representations for a given timestamp t and Ω represents the set of overlapped augmented timestamps. On the other hand, Instance-wise Contrastive Loss is defined as:

ℓ (i,t) inst = − log

exp ( ri,t · r′i,t

)∑B j=1

( exp

( ri,t·′j,t

) + 1[i ̸=j] exp (ri,t · rj,t)

) , (4.2)

where B is the Batch size. This loss considers the same augmented posit- ive pairs, but the representations of other time series in the same batch at timestamp t as negative samples. The overall loss is formulated as:

Ldual = 1

∑ i

∑ t

( ℓ (i,t) temp + ℓ

(i,t) inst

) (4.3)

4.2 Problem Statement

Consider a set of time series X = {x1, x2, · · · , xN} where each time series instance xi has dimensions T × F , with T representing the sequence length and F denoting the feature dimension. To enhance feature extraction and capture dependencies across sequences, an attention mechanism Att(xi; θ) is integrated into the dilated CNN module. Concurrently, a Generative Ad- versarial Network (GAN) augments the dataset X with synthetic samples x̃i = G(zi; θ), where zi ∼ P (z), enriching X̃ = X ∪ {x̃1, x̃2, ..., x̃M} for im- proved model generalization. Following this, we obtain the representation vectors ri = {ri,1, ri,2, · · · , ri,T} for each timestamp t.

4.3 Proposed Method

The overall architecture of Synth-TS2Vec is illustrated in Figure 4.3. This structure presents a synthetic time series generation strategy for data aug- mentation that employs a Generative Adversarial Network (GAN) withWasser- stein loss and gradient penalty (WGAN-GP) to improve training dynamics. The framework consists of two essential components: the Generator (G) and the Discriminator (D), which is also referred to as the Critic (D) in this method. Both components employ Long Short-Term Memory (LSTM) lay- ers to capture the sequential dependencies, present in time series data. We adapt the structure of our GAN partially from (Proceduralia 2019) which was developed for time series data generation. In our approach, the encoder fθ utilizes both original and synthetic time series data by using the pre-trained Critic, D. Throughout the training, the merged dataset aims to produce much more reliable representations for improved accuracy in classification tasks. Therefore including the Critic into the representation model helps to better generalize across different time series datasets. In the following subsections, we will discuss the architecture in detail.

Figure 4.3: Overall architecture of Synth-TS2Vec. The model consists of two parts: (1) A Generative Adversarial Network to generate synthetic samples, (2) an encoder that learns representations of input time series instances through hierarchical contrastive loss.

4.3.1 Generator

The structure of Generator G is composed of a single LSTM layer. It works with a noise vector z, drawn from a pre-defined latent space Z with dis- tribution pz(z). In particular, z is sampled from a Gaussian distribution, z ∼ N (0, I100). The LSTM layer uses various gates and cells to process z in order to store previous information in memory. After the recurrent layer, the network employs LeakyReLU activation: LeakyReLU(x) = max(αx, x) to maintain gradient flow and prevent any neuron inactivity. Finally, the LSTM output is fed to a dense layer that transforms it back into sequences mimicking the actual time series data.

The function G(z; θg) maps the noise vector to the data space, with θg

denoting the parameters of G. The output sequence is given as:

x̃i = G(z; θg)

where, x̃i represents the synthesized time series data. The weights of G, de- noted asWg, are initialised via Xavier initialization (Glorot and Bengio 2010) to guarantee that the model parameters are initialised optimally. Through this method, we work on preserving the variance of the gradients and the backpropagated error signals to avoid the issue of vanishing or exploding gradients while training.

4.3.2 Critic (Discriminator)

The architecture of the Critic is aligned with the design of the LSTM-based Generator by addressing the sequential nature of the time-series data. Sim- ilar to the Generator, the Critic also employs an LSTM layer, which allows for effective processing of temporal dependencies present in the data. The network concludes with a fully-connected output layer. Instead of returning a probability, the Critic outputs a real-valued score D(xi; θd), where θd are the learned parameters of the Critic. This enables the Critic to serve its purpose with respect to the WGAN-GP framework (Gulrajani et al. 2017): given an input time series sequence xi, it returns a non-probabilistic score D(xi; θd) which measures the correctness of xi without classifying instances as real or fake. These scores help in calculating the Wasserstein distance during training and thus provide more stable, meaningful updates, reflective of differences between the distributions of real and generated data. The de- tailed architectural specifications for both networks are provided in Appendix section A.1.

4.3.3 Training

Traditional GAN training follows the idea of a two-player minimax game, where the generator and discriminator have competitive goals, as defined in the following function by (Goodfellow et al. 2014):

V (G,D) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))] (4.4)

However, in this research, we use Wasserstein GAN with Gradient Pen- alty (WGAN-GP) (Gulrajani et al. 2017), an improved variant of the original GAN. WGAN-GP addresses the limitations of a standard WGAN, which employs weight clipping to enforce a Lipschitz constraint, resulting in subop- timal sample quality (Gulrajani et al. 2017). Prior to the adversarial training

Algorithm 1 Training Procedure for LSTM-based WGAN-GP

Require: nepochs = 50, batch size = 16, learning rate α = 0.001, gradient penalty coefficient λ = 10, critic updates per step ncritic = 5

Ensure: Trained generator G and critic D 1: Initialize G and D with Xavier initialization 2: Initialize Adam optimizers for G and D 3: for epoch in 1, 2, . . . , nepochs do 4: for each batch of training data x ∼ Pr do 5: for t = 1 to ncritic do 6: Sample noise z ∼ Pz

7: Generate fake data: x̃ = G(z) 8: Compute Wasserstein loss and gradient penalty for D 9: Update D weights using Adam optimizer

10: end for 11: Sample new noise z ∼ Pz for generator update 12: Generate fake data: x̃ = G(z) 13: Compute generator loss for G 14: Update G weights using Adam optimizer 15: end for 16: end for 17: return trained models G and D

process, the training data undergoes a preprocessing phase to normalize input data. We perform no hyperparameter tuning and follow the hyperparameter settings from (Gulrajani et al. 2017) with a few modifications. Following (Huang and Deng 2023), we use a batch size of 16 due to the significant limitations in sample sizes present in UCR datasets.

WGAN-GP diverges from the WGAN objective by employing an altern- ative loss function for the training of Critic (D), which includes a gradient penalty to ensure that the model satisfies the Lipschitz constraint:

LD = E x̃∼Pg

[D(x̃)]− E x∼Pr

[D(x)] + λ E x̂∼Px̂

[(∥∇x̂D(x̂)∥2−1)2] (4.5)

The training progresses iteratively over numerous epochs. In each epoch, the Generator (G) seeks to minimize its loss function.:

LG = −E z∼Pz

[D(G(z))] (4.6)

Concurrently, the Critic (D) is updated more frequently to refine its eval- uation of the data. Algorithm 1 provides a full description of the training technique, inspired by (Gulrajani et al. 2017). During training, the Generator and Critic steadily improve. The Critic refines its ability to assess the gen- erated data by optimizing its loss function LD, while the Generator tries to replicate real time series data more accurately, guided by the feedback from the Critic. After training the GAN, we use PCA to evaluate GAN-generated samples, reducing the high-dimensional data to two principal components for analysis.

4.3.4 Representation Encoder

We reuse the pre-trained discriminator D to generate synthetic samples, which are mixed with the original data. At first, we preprocess both the synthetic and original data. After preprocessing, we randomly select two overlapping subseries from the combined time series input xi to promote consistency in their contextual representations within the overlapping seg- ment. The process of random cropping is detailed in Appendix A.2. To extract meaningful representations from time series data, which effectively captures both local and long-range dependencies, we modify our encoder fθ as follows: (1) an input projection layer; (2) Dilated convolutions with self- attention; (3) hierarchical contrasting with adaptive pooling.

Input Projection Layer The input projection layer, presented as the first layer of the encoder in (Yue et al. 2022), functions as a fully connec- ted layer. It maps each input xi,t in input time series at a corresponding timestamp t to high-dimensional feature vector zi,t. This transformation al- lows the model to capture the complex features of the time series data in a higher-dimensional space.

Dilated CNN with Self-Attention In contrast to TS2Vec, we integ- rate a self-attention mechanism into the dilated CNN module. This fusion attempts to capture dependencies within the input time series xi more ef- fectively. Dilated convolutions are designed to achieve a larger receptive field without significantly increasing computational complexity (Bai, Kolter and Koltun 2018). These layers understand the positional order of the data through localized receptive fields and shared weights, enabling the network to identify long-range dependencies without employing positional encoding (Yue et al. 2022).

Synth-TS2Vec is composed of ten residual blocks, each designed to refine contextual features at each tmestamp t. These blocks consist of two dilated 1-D convolutional layers. The dilation rate increases exponentially as 2l for the l-th block, which enhances the network’s capacity to integrate informa- tion across extensive spans of the sequence. Each block concludes with layer normalization and a skip connection, promoting consistent information flow and feature resolution throughout the network’s depth.

Each dilated convolutional layer uses a Gaussian Error Linear Unit (GELU) to add non-linearity and prevent vanishing gradients which is a known issue with rectified linear units (ReLUs) (Hendrycks and Gimpel 2023).

We delve into how the model combines self-attention with dilated convo- lutions. It uses the output from the dilated convolutional layers to generate three distinct sets of representations: queries (Q), keys (K), and values (V ). Here, the queries represent the current elements to analyze, the keys provide a basis against which comparisons are made, and finally the values store the information that an attention mechanism is trying to capture.

The model uses a multi-head attention mechanism with four heads to pro- cess different features of the input sequence in parallel. Each head applies transformations to the queries (Q), keys (K), and values (V ). By combining the processed data from all heads, the model gains a comprehensive under- standing of the sequence. For each data point, the model computes attention

scores by performing scaled dot products between queries and keys. Accord- ing to (Vaswani et al. 2017), scaling down the scores improves the model’s training stability and prevents the softmax gradients from getting too small. The normalized attention scores are computed using the softmax function, as shown in Equation 4.7, from (Vaswani et al. 2017):

Attention(Q,K, V ) = softmax

( QKT

√ dk

) V (4.7)

This mechanism enables the model to weight values based on these nor- malized probabilities, thus focusing on the most significant information. To prevent overfitting, we add a dropout rate of 0.1, which contributes to the model’s enhanced ability to generalize. By combining dilated convolutions with multi-head attention, our model extracts a contextual representation at each timestamp t, thus enabling it to accurately capture complex temporal dependencies.

4.3.5 Hierarchical Contrasting

Our contrastive learning framework learns representations of time series at different scales. It learns representations by discriminating between positive and negative samples, not just at the instance level but also across different temporal dimensions. This means that the model will be trained to under- stand the data across various timescales. The model uses the hierarchical contrastive loss introduced in Section 4.1.

The hierarchical contrastive loss in (Yue et al. 2022) uses max pooling with a fixed kernel size to compress feature representations along the tem- poral dimension at each hierarchical level. In our improved framework, we enhance the original model by integrating an adaptive pooling within our hierarchical contrastive loss. Our enhanced framework adds adaptive pool- ing into the hierarchical contrastive loss calculation, offering dynamic target size adjustment and flexibility across different sequence lengths. However, padding is still employed as in baseline, to ensure equal lengths, but adaptive pooling focuses on extracting the most relevant features, thus improving the quality of temporal information. Specifically, adaptive pooling minimizes the negative impact of non-informative zero-padding and better preserves mean- ingful features across hierarchical levels. This results in reduced information loss, improved training stability, and more robust representation learning.

The instance-wise and temporal contrastive losses in the hierarchical framework are given in Section 4.1. Our method calculates the hierarchical

contrastive loss as described in Algorithm 0. We also notice an improvement in classification accuracy (see 5.4), suggesting more effective information pre- servation and a potential increase in the learning process.

Algorithm 2 Calculating the hierarchical contrastive loss

1: procedure HIERLOSS ADAPTIVE(r, r′) 2: Lhier ← Ldual(r, r

′); 3: d← 1; 4: while time length(r) > 1 5: // The adaptive pool1d operates along the time axis 6: Ttarget ← max(1, time length(r)//2); 7: r ← adaptive pool1d(r, output size = Ttarget); 8: r′ ← adaptive pool1d(r′, output size = Ttarget); 9: Lhier ← Lhier + Ldual(r, r

′); 10: d← d+ 1; 11: end while 12: Lhier ← Lhier/d; 13: return Lhier

14: end procedure

Chapter 5

Experiments

This chapter highlights the data foundation and provides details on all the evaluations of our proposed framework.

5.1 Experimental Setup

5.1.1 Datasets

To evaluate our proposed framework, we conducted experiments on UCR and UEA real-world datasets for time series classification. The UCR archive (Dau et al. 2018) consists of 128 univariate datasets, and the UEA archive comprises 30 multivariate datasets. The univariate datasets in the UCR archive have categories including image, motion, spectograph, ECG meas- urements, sensor, etc. (Dau et al. 2018). According to (Bagnall et al. 2018), the datasets in the UEA archive are split into Human Activity Recognition (HAR), Motion Classification, ECG Classification, EEG/MEG Classification (6 problems), Audio Spectra Classification, and others. Tables B.1 and B.2 present the details on the UCR and UEA datasets.

5.1.2 Data Preprocessing

We follow the data preprocessing from (Franceschi, Dieuleveut and Jaggi 2019) and normalize time series datasets from the UCR archive, ensuring zero mean and unit variance. For UEA datasets, each dimension is independently normalized in the same manner. We also retain the pre-defined train and test splits from the archives. Following (Yue et al. 2022), to make a dataset with varying lengths uniform, we add padding to all the series. The NaNs denote the missing observations in our implementation, where we assign them as

padded values. In the case of a missing observation (NaN), the mask would have its corresponding position set to zero.

5.1.3 Baselines

We evaluate our Synth-TS2Vec with six state-of-the-art self-supervised rep- resentation learning methods for time series classification, including TS2Vec (Yue et al. 2022), TS-TCC (Eldele, Ragab, Z. Chen, Wu, C. K. Kwoh et al. 2021), T-Loss (Franceschi, Dieuleveut and Jaggi 2019), TNC (Tonekaboni, Eytan and Goldenberg 2021), TST (Zerveas et al. 2021), and DTW (Y. Chen et al. 2013). The results of these methods are taken from (Yue et al. 2022), and we follow the same protocol for Synth-TS2Vec for all datasets.

5.1.4 Evaluation Metrics

To evaluate classification performance, we utilize accuracy 5.1 on the test set Acc(Xtest, ŷθ), whereXtest represents the test set and ŷθ denotes the predicted labels generated by the model with parameters θ.

Acc(Xtest, ŷθ) = 1

Ntest

Ntest∑ i=1

I(yi = ŷθ(xi)) (5.1)

For the GAN-generated samples, we evaluate the synthetic and original samples using PCA.

5.2 Implementation Details

Our methodology consists of a two-phase approach and uses a uniform set of hyperparameters. Initially, we trained a generative adversarial network to generate synthetic data. Subsequently, we trained our representation model using synthetic samples from the pre-trained discriminator for time series classification tasks. We used Python 3.8.8 for our implementation with Py- Torch 1.8.1. We conducted our experiments on the ISMLL cluster, featuring NVIDIA GeForce RTX 2070S, and performed five independent runs for each experiment. During the adversarial training phase, we set the default batch size to 16 and used the Adam optimizer (Kingma and J. Ba 2017). The model goes over 50 epochs for all datasets, with a learning rate of 0.001. Follow- ing the recommendations of (Gulrajani et al. 2017), we did not apply batch normalization in our network architecture. For the representation model, we follow the hyperparameter settings in (Yue et al. 2022). We perform no hyperparameter optimization, as the labels are not available during the

training stage. Instead, we applied uniform hyperparameters. The repres- entation model is trained using only a train set for each task, and we obtain representation by using the test set. We train our model with a learning rate of 0.001 and a default batch size of 8. For datasets with fewer than 100,000 instances, we conduct 200 optimisation iterations; for larger datasets, 600 iterations are performed. Further, (Yue et al. 2022) divides a long sequence into segments with 3,000 timestamps each during the training phase. Within the linear projection layer, the hidden channel size is set to 64. The kernel size for the dilated CNN is fixed at 3. In our convolutional network archi- tecture, each hidden dilated convolution utilizes a channel size of 64, and the dilated convolutions incorporate four heads of self-attention. An output residual block transforms the hidden channels to the output dimension, with the representation dimension set to 320 (Yue et al. 2022). Full details of hyperparameter settings are given in Appendix B.

5.2.1 Time Series Classification

For all datasets, each class is assigned a label that applies to the entire time series (instance). Consequently, representations at the level of the instance are needed, and these can be derived by performing adaptive pooling across all time points (Yue et al. 2022). We evaluate the quality of our learned representations through a traditional supervised method. We use the train labels of the dataset to train a Support Vector Machine (SVM) with a radial basis function kernel on the learned representations, and we report the clas- sification score on the test set (Franceschi, Dieuleveut and Jaggi 2019; Yue et al. 2022).

5.3 Results

For classification accuracy on UCR and UEA datasets, we performed exper- iments by training an SVM on the learned representations. The results are summarized in Tables 5.1 and 5.2 show that Synth-TS2Vec outperforms all state-of-the art including TS2Vec. The accuracy of Synth-TS2Vec is +0.31% higher than TS2Vec on first 125 UCR datasets and +4.42% on first 29 UEA datasets. We also report the rank of our proposed framework across all data- sets. For a comparison of Synth-TS2Vec and TS2Vec for different batch sizes on UCR datasets and defualt batch size on UEA, we also show results in Appendix B.3 and B.6.

Dataset Synth-TS2Vec TS2Vec T-Loss TNC TS-TCC TST DTW Rank

Adiac 0.788 0.762 0.675 0.726 0.767 0.550 0.604 1 ArrowHead 0.850 0.857 0.766 0.703 0.737 0.771 0.703 2 Beef 0.760 0.767 0.667 0.733 0.600 0.500 0.633 2 BeetleFly 0.900 0.900 0.800 0.850 0.800 1.000 0.700 2 BirdChicken 0.800 0.800 0.850 0.750 0.650 0.650 0.750 2 Car 0.900 0.833 0.833 0.683 0.583 0.550 0.733 1 CBF 1.000 1.000 0.983 0.983 0.998 0.898 0.997 1 ChlorineConcentration 0.830 0.832 0.749 0.760 0.753 0.562 0.648 2 CinCECGTorso 0.827 0.827 0.713 0.669 0.671 0.508 0.651 1 Coffee 1.000 1.000 1.000 1.000 1.000 0.821 1.000 1 Computers 0.641 0.660 0.664 0.684 0.704 0.696 0.700 7 CricketX 0.790 0.782 0.713 0.623 0.731 0.385 0.754 1 CricketY 0.760 0.749 0.728 0.597 0.718 0.467 0.744 1 CricketZ 0.800 0.792 0.708 0.682 0.713 0.403 0.754 1 DiatomSizeReduction 0.984 0.984 0.984 0.993 0.977 0.961 0.967 1 DistalPhalanxOutlineCorrect 0.780 0.761 0.775 0.754 0.754 0.728 0.717 1 DistalPhalanxOutlineAgeGroup 0.734 0.727 0.727 0.741 0.755 0.741 0.770 4 DistalPhalanxTW 0.712 0.698 0.676 0.669 0.676 0.568 0.590 1 Earthquakes 0.750 0.748 0.748 0.748 0.748 0.748 0.719 1 ECG200 0.910 0.920 0.940 0.830 0.880 0.830 0.770 3 ECG5000 0.942 0.935 0.933 0.937 0.941 0.928 0.924 1 ECGFiveDays 1.000 1.000 1.000 0.999 0.878 0.763 0.768 1 ElectricDevices 0.730 0.721 0.707 0.700 0.686 0.676 0.602 1 FaceAll 0.790 0.771 0.786 0.766 0.813 0.504 0.808 3 FaceFour 0.900 0.932 0.920 0.659 0.773 0.511 0.830 3 FacesUCR 0.920 0.924 0.884 0.789 0.863 0.543 0.905 2 FiftyWords 0.780 0.771 0.732 0.653 0.653 0.525 0.690 1 Fish 0.920 0.926 0.891 0.817 0.817 0.720 0.823 2 FordA 0.940 0.936 0.928 0.902 0.930 0.568 0.555 1 FordB 0.794 0.794 0.793 0.733 0.815 0.507 0.620 2 GunPoint 0.980 0.980 0.980 0.967 0.993 0.827 0.907 2 Ham 0.760 0.714 0.724 0.752 0.743 0.524 0.467 1 HandOutlines 0.922 0.922 0.922 0.930 0.724 0.735 0.881 2 Haptics 0.526 0.526 0.490 0.474 0.396 0.357 0.377 1 Herring 0.580 0.641 0.594 0.594 0.594 0.594 0.531 3 InlineSkate 0.415 0.415 0.371 0.378 0.347 0.287 0.384 1 InsectWingbeatSound 0.653 0.630 0.597 0.549 0.415 0.266 0.355 1 ItalyPowerDemand 0.950 0.925 0.954 0.928 0.955 0.845 0.950 3 LargeKitchenAppliances 0.820 0.845 0.789 0.776 0.848 0.595 0.795 3 Lightning2 0.930 0.869 0.869 0.869 0.836 0.705 0.869 1 Lightning7 0.823 0.863 0.795 0.767 0.685 0.411 0.726 2 Mallat 0.914 0.914 0.951 0.871 0.922 0.713 0.934 4 Meat 0.940 0.950 0.950 0.917 0.883 0.900 0.933 2 MedicalImages 0.790 0.789 0.750 0.754 0.747 0.632 0.737 1 MiddlePhalanxOutlineCorrect 0.830 0.838 0.825 0.818 0.818 0.753 0.698 2 MiddlePhalanxOutlineAgeGroup 0.630 0.636 0.656 0.643 0.630 0.617 0.500 4 MiddlePhalanxTW 0.590 0.584 0.591 0.571 0.610 0.506 0.506 3 MoteStrain 0.870 0.861 0.851 0.825 0.843 0.768 0.835 1 NonInvasiveFetalECGThorax1 0.930 0.930 0.878 0.898 0.898 0.471 0.790 1 NonInvasiveFetalECGThorax2 0.938 0.938 0.919 0.912 0.913 0.832 0.865 1 OliveOil 0.900 0.900 0.867 0.833 0.800 0.800 0.833 1 OSULeaf 0.860 0.851 0.760 0.723 0.723 0.545 0.591 1 PhalangesOutlinesCorrect 0.820 0.809 0.784 0.787 0.804 0.773 0.728 1 Phoneme 0.312 0.312 0.276 0.180 0.242 0.139 0.228 1 Plane 1.000 1.000 0.990 1.000 1.000 0.933 1.000 1 ProximalPhalanxOutlineCorrect 0.910 0.887 0.859 0.866 0.873 0.770 0.784 1 ProximalPhalanxOutlineAgeGroup 0.850 0.834 0.844 0.854 0.839 0.854 0.805 2 ProximalPhalanxTW 0.815 0.824 0.771 0.810 0.800 0.780 0.761 2 RefrigerationDevices 0.580 0.589 0.515 0.565 0.563 0.483 0.464 2 ScreenType 0.430 0.411 0.416 0.509 0.419 0.419 0.397 2

Dataset Synth-TS2Vec TS2Vec T-Loss TNC TS-TCC TST DTW Rank

ShapeletSim 1.000 1.000 0.672 0.589 0.683 0.489 0.650 1 ShapesAll 0.900 0.902 0.848 0.788 0.773 0.733 0.768 2 SmallKitchenAppliances 0.740 0.731 0.677 0.725 0.691 0.592 0.643 1 SonyAIBORobotSurface1 0.900 0.903 0.902 0.804 0.899 0.724 0.725 3 SonyAIBORobotSurface2 0.910 0.871 0.889 0.834 0.907 0.745 0.831 1 StarLightCurves 0.969 0.969 0.964 0.968 0.967 0.949 0.907 1 Strawberry 0.970 0.962 0.954 0.951 0.965 0.916 0.941 1 SwedishLeaf 0.940 0.941 0.914 0.880 0.923 0.738 0.792 2 Symbols 0.960 0.976 0.963 0.885 0.916 0.786 0.950 3 SyntheticControl 1.000 0.997 0.987 1.000 0.990 0.490 0.993 1 ToeSegmentation1 0.930 0.917 0.939 0.864 0.930 0.807 0.772 2 ToeSegmentation2 0.931 0.892 0.900 0.831 0.877 0.615 0.838 1 Trace 1.000 1.000 0.990 1.000 1.000 1.000 1.000 1 TwoLeadECG 0.970 0.986 0.999 0.993 0.976 0.871 0.905 4 TwoPatterns 1.000 1.000 0.999 1.000 0.999 0.466 1.000 1 UWaveGestureLibraryX 0.823 0.795 0.785 0.781 0.733 0.569 0.728 1 UWaveGestureLibraryY 0.750 0.719 0.710 0.697 0.641 0.348 0.634 1 UWaveGestureLibraryZ 0.770 0.770 0.757 0.721 0.690 0.655 0.658 1 UWaveGestureLibraryAll 0.930 0.930 0.896 0.903 0.692 0.475 0.892 1 Wafer 1.000 0.998 0.992 0.994 0.994 0.991 0.980 1 Wine 0.910 0.870 0.815 0.759 0.778 0.500 0.574 1 WordSynonyms 0.692 0.676 0.691 0.630 0.531 0.422 0.649 1 Worms 0.701 0.701 0.727 0.623 0.753 0.455 0.584 3 WormsTwoClass 0.805 0.805 0.792 0.727 0.753 0.584 0.623 1 Yoga 0.880 0.887 0.837 0.812 0.791 0.830 0.837 2 ACSF1 0.900 0.900 0.900 0.730 0.730 0.760 0.640 1 AllGestureWiimoteX 0.760 0.777 0.763 0.703 0.697 0.259 0.716 3 AllGestureWiimoteY 0.778 0.793 0.726 0.699 0.741 0.423 0.729 2 AllGestureWiimoteZ 0.700 0.746 0.723 0.646 0.689 0.447 0.643 3 BME 0.990 0.993 0.993 0.973 0.933 0.760 0.900 2 Chinatown 0.978 0.965 0.951 0.977 0.983 0.936 0.957 2 Crop 0.760 0.756 0.722 0.738 0.742 0.710 0.665 1 EOGHorizontalSignal 0.539 0.539 0.605 0.442 0.401 0.373 0.503 2 EOGVerticalSignal 0.503 0.503 0.434 0.392 0.376 0.298 0.448 1 EthanolLevel 0.468 0.468 0.382 0.424 0.486 0.260 0.276 2 FreezerRegularTrain 0.991 0.986 0.956 0.991 0.989 0.922 0.899 1 FreezerSmallTrain 0.880 0.870 0.933 0.982 0.979 0.920 0.753 5 Fungi 0.920 0.957 1.000 0.527 0.753 0.366 0.839 3 GestureMidAirD1 0.600 0.608 0.608 0.431 0.369 0.208 0.569 2 GestureMidAirD2 0.610 0.469 0.546 0.362 0.254 0.138 0.608 1 GestureMidAirD3 0.280 0.292 0.285 0.292 0.177 0.154 0.323 4 GesturePebbleZ1 0.900 0.930 0.919 0.378 0.395 0.500 0.791 3 GesturePebbleZ2 0.800 0.873 0.899 0.316 0.430 0.380 0.671 3 GunPointAgeSpan 1.000 0.987 0.994 0.984 0.994 0.991 0.918 1 GunPointMaleVersusFemale 1.000 1.000 0.997 0.994 0.997 1.000 0.997 1 GunPointOldVersusYoung 1.000 1.000 1.000 1.000 1.000 1.000 0.838 1 HouseTwenty 0.916 0.916 0.933 0.782 0.790 0.815 0.924 3 InsectEPGRegularTrain 1.000 1.000 1.000 1.000 1.000 1.000 0.872 1 InsectEPGSmallTrain 1.000 1.000 1.000 1.000 1.000 1.000 0.735 1 MelbournePedestrian 0.941 0.959 0.944 0.942 0.949 0.741 0.791 5 MixedShapesRegularTrain 0.959 0.917 0.905 0.911 0.855 0.879 0.842 1 MixedShapesSmallTrain 0.917 0.861 0.860 0.813 0.735 0.828 0.780 1 PickupGestureWiimoteZ 0.740 0.820 0.740 0.620 0.600 0.240 0.660 2 PigAirwayPressure 0.630 0.630 0.510 0.413 0.380 0.120 0.106 1 PigArtPressure 0.966 0.966 0.928 0.808 0.524 0.774 0.245 1 PigCVP 0.812 0.812 0.788 0.649 0.615 0.596 0.154 1 PLAID 0.561 0.561 0.555 0.495 0.445 0.419 0.840 2 PowerCons 0.970 0.961 0.900 0.933 0.961 0.911 0.878 1 Rock 0.700 0.700 0.580 0.580 0.600 0.680 0.600 1 SemgHandGenderCh2 0.963 0.963 0.890 0.882 0.837 0.725 0.802 1

Dataset Synth-TS2Vec TS2Vec T-Loss TNC TS-TCC TST DTW Rank

SemgHandMovementCh2 0.860 0.860 0.789 0.593 0.613 0.420 0.584 1 SemgHandSubjectCh2 0.951 0.951 0.853 0.771 0.753 0.484 0.727 1 ShakeGestureWiimoteZ 0.920 0.940 0.920 0.820 0.860 0.760 0.860 2 SmoothSubspace 0.981 0.980 0.960 0.913 0.953 0.827 0.827 1 UMD 1.000 1.000 0.993 0.993 0.986 0.910 0.993 1 DodgerLoopDay 0.562 0.562 – – – 0.200 0.500 1 DodgerLoopGame 0.841 0.841 – – – 0.696 0.877 2 DodgerLoopWeekend 0.964 0.964 – – – 0.732 0.949 1 On the first 125 datasets: AVG 0.832 0.830 0.806 0.761 0.757 0.641 0.727 Rank 1.728 1.82 2.73 3.52 3.38 5.23 4.33

Table 5.1: Full Results on first 125 UCR datasets

Dataset Synth-TS2Vec TS2Vec T-Loss TNC TS-TCC TST DTW Rank

ArticularyWordRecognition 0.986 0.987 0.943 0.973 0.953 0.977 0.987 2 AtrialFibrillation 0.200 0.200 0.133 0.133 0.26 7 0.067 0.200 2 BasicMotions 1.000 0.975 1.000 0.975 1.000 0.975 0.975 1 CharacterTrajectories 0.994 0.995 0.993 0.967 0.985 0.975 0.989 2 Cricket 0.972 0.972 0.972 0.958 0.917 1 .00 1.000 2 DuckDuckGeese 0.600 0.680 0.650 0.460 0.380 0.620 0.600 4 EigenWorms 0.850 0.847 0.840 0.840 0.779 0.748 0.618 1 Epilepsy 0.964 0.964 0.971 0.957 0.957 0.949 0.964 2 ERing 0.874 0.874 0.133 0.852 0.904 0.874 0.133 2 EthanolConcentration 0.538 0.308 0.205 0.297 0.285 0.262 0.323 1 FaceDetection 0.530 0.501 0.513 0.536 0.544 0.534 0.529 4 FingerMovements 0.550 0.480 0.580 0.470 0.460 0.560 0.530 3 HandMovementDirection 0.324 0.338 0.351 0.324 0.243 0.243 0.231 3 Handwriting 0.742 0.515 0.451 0.249 0.498 0.225 0.286 1 Heartbeat 0.737 0.683 0.741 0.746 0.751 0.746 0.717 4 JapaneseVowels 0.981 0.984 0.989 0.978 0.930 0.978 0.949 3 Libras 0.866 0.867 0.883 0.817 0.822 0.656 0.870 3 LSST 0.562 0.537 0.509 0.595 0.474 0.408 0.551 2 MotorImagery 0.510 0.510 0.580 0.500 0.610 0.500 0.500 3 NATOPS 0.936 0.928 0.917 0.911 0.822 0.850 0.883 1 PEMS-SF 0.666 0.682 0.676 0.699 0.734 0.740 0.711 6 PenDigits 0.987 0.989 0.981 0.979 0.974 0.560 0.977 2 PhonemeSpectra 0.256 0.233 0.222 0.207 0.252 0.085 0.151 1 RacketSports 0.868 0.855 0.855 0.776 0.816 0.809 0.803 1 SelfRegulationSCP1 0.846 0.812 0.843 0.799 0.823 0.754 0.775 1 SelfRegulationSCP2 0.578 0.578 0.539 0.550 0.533 0.550 0.539 1 SpokenArabicDigits 0.988 0.988 0.905 0.934 0.970 0.923 0.963 1 StandWalkJump 0.766 0.467 0.333 0.400 0.333 0.267 0.200 1 UWaveGestureLibrary 0.900 0.906 0.875 0.759 0.753 0.575 0.903 3 InsectWingbeat 0.460 0.466 0.156 0.469 0.264 0.105 - 3 On the first 29 datasets: AVG 0.7438 0.712 0.675 0.677 0.682 0.635 0.650 Rank 2.17 2.40 3.121 3.845 3.534 4.362 3.741

Table 5.2: Full Results on first 29 UEA datasets

5.3.1 Visualization

To evaluate the fidelity of the generated samples, we visualize the original and synthetic samples using PCA in a 2-dimensional space. Figures 5.1 and 5.2 show the PCA visualizations on various datasets from the UCR and UEA archives. This method allows us to assess how accurately the gener- ated samples replicate the distribution of the original samples. Additionally, Figure 5.3 shows t-SNE visualization of the learned representations.

(a) Crop (b) FordA (c) CricketX

(d) SmoothSubspace (e) ElectricDevices

Figure 5.1: PCA plots of 5 UCR datasets on 50 epochs. Red denotes original and blue denotes synthetic data.

(a) NATOPS (b)RacketSports (c) PhonemeSpectra

(d) FingerMovements (e) PenDigits

Figure 5.2: PCA plots of 5 UEA datasets on 50 epochs. Red denotes original and blue denotes synthetic data.

(a) Crop (b) ElectricDevices

(d) ChlorineConcentration (e) TwoPatterns

(f) Wafer

Figure 5.3: t-SNE plots of learned embeddings on top 6 UCR datasets with the most test samples. Each class is represented by a different color.

(a) Epoch=3 (b) Epoch=50 (c) Epoch=100

Figure 5.4: PCA plots of synthetic samples on ElectricDevices dataset from the UCR archive.

In our analysis, we further investigate the impact of varying the number of training epochs on the ElectricDevices dataset, which comprises a test set of 7,711 samples. During the training process, we notice that the synthetic samples (blue dots) move closer to the original samples (red dots). This demonstrates that with an increasing number of epochs, the model learns the underlying distribution of the original data, as shown in Figure 5.4.

5.3.2 Training Time

In our evaluation, we also report the training time of our model and TS2Vec. We observed a significant increase in the training time of our model when aug- menting our dataset with GAN-generated samples as compared to TS2Vec. This increased training time can be attributed to the increase in training data.

Datasets Original Proposed

125 UCR datasets 0.9 hours 3.0 hours 30 UEA datasets 0.6 hours 1.5 hours

Table 5.3: Execution time of TS2Vec vs Synth-TS2Vec

5.4 Ablation Study

We verify the impact of all the components in Synth-TS2Vec and its vari- ants. Tables 5.4 and 5.5 show ablation results on the first 125 UCR and 29 UEA datasets, where (1) w/o GAN Synthetic Samples removes syn- thetic samples from the model. (2) 4 heads → 2 heads replace 4 heads in

self-attention with 2 heads, (3) adaptive pooling → max pooling uses max pooling for obtaining representations of a sub-series, (4,5,6) timestamp masking applies timestamp masking to the model, which is illustrated in Figure B.1, (7) SVM → KNN trains a KNN classifier, (8) w/o Atten- tion removes attention, (9) Dilated CNN kernel size: 2 → 3, replaces the kernel size in dilated CNN to 3, and (10) Dilated CNN convolution layer: 2→ 1, which employs only 1 convolution layer within a residual block.

Based on the ablation results, we conclude that certain components signi- ficantly contribute to the model’s performance. In particular, we observe a performance drop in Synth-TS2Vec when we remove attention or synthetic samples. Specifically, adding attention to TS2Vec improves the model’s ac- curacy, implying that attention mechanisms effectively capture dependencies within time series data. While other components such as the number of self- attention heads, pooling strategies, and CNN configurations also influence the accuracy, their impact is comparatively minor as compared to attention and GAN-generated samples.

Ablation type Avg. Accuracy

Default(Synth-TS2Vec) 0.8322 (+0.310%) w/o GAN Synthetic Samples 0.8310 (−0.15%)

4 heads → 2 heads 0.8254 (−0.81%) adaptive pooling → max pooling 0.8305 (−0.20%) w/ binomial timestamp masking 0.8312 (−0.12%) w/ gaussian timestamp masking 0.8226 (−1.15%) w/ continuous timestamp masking 0.8175 (−1.15%)

SVM → KNN 0.8301 (−0.25%) w/o Attention 0.8231 (−0.82%)

Dilated CNN kernel size: 2 → 3 0.8229 (−1.10%) Dilated CNN convolution layer: 2 → 1 0.8102 (−2.65%)

Table 5.4: Ablation Results on first 125 UCR datasets

Ablation type Avg. Accuracy

Default(Synth-TS2Vec) 0.7438 (+4.42%) w/o GAN Synthetic Samples 0.7150 (−3.99%)

4 heads → 2 heads 0.7128 (−4.33%) adaptive pooling → max pooling 0.6985 (−6.47%) w/ binomial timestamp masking 0.7111 (−4.58%) w/ gaussian timestamp masking 0.7018 (−5.96%) w/ continuous timestamp masking 0.7052 (−5.38%)

SVM → KNN 0.7087 (−4.94%) w/o Attention 0.7123 (−4.39%)

Dilated CNN kernel size: 2 → 3 0.7189 (−3.46%) Dilated CNN convolution layer: 2→ 1 0.7006(−6.14%)

Table 5.5: Ablation Results on first 29 UEA datasets

Chapter 6

Conclusion

In this thesis, we highlight learning time series representation for classific- ation tasks. For our research, we expand upon the research conducted in “TS2Vec: Towards Universal Representation of Time Series” (Yue et al. 2022). We carried out extensive experimentation on our proposed model through ablation studies and compared it with state-of-the-art representa- tion methods for time series data. Through our research, we conclude that a combination of generative adversarial networks (GANs) and random crop- ping for data augmentation, along with self-attention integrated into dilated convolutions, improved our framework’s accuracy. Overall, our model has demonstrated a strong ability to capture complex temporal patterns at dif- ferent semantic levels on both univariate and multivariate datasets.

We further suggest investigating data augmentation techniques for future research, such as using Variational Autoencoders (VAEs), potentially gen- erating diverse synthetic samples. We also noticed that by increasing the number of epochs during GAN training, the model captures the distribution of original data more precisely. These avenues can be researched to unlock the capabilities of representation learning for time series classification tasks.

Bibliography

Ariyo, Adebiyi A., Adewumi O. Adewumi and Charles K. Ayo (2014). “Stock Price Prediction Using the ARIMA Model”. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106– 112.

Arjovsky, Martin, Soumith Chintala and Léon Bottou (2017). “Wasserstein generative adversarial networks”. In: International conference on machine learning. PMLR, pp. 214–223.

Ba, Jimmy Lei, Jamie Ryan Kiros and Geoffrey E. Hinton (2016). Layer Normalization.

Baevski, Alexei et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.

Bagnall, Anthony et al. (2018). “The UEA multivariate time series classific- ation archive, 2018”. In: arXiv preprint arXiv:1811.00075.

Bai, Shaojie, J Zico Kolter and Vladlen Koltun (2018). “An empirical evalu- ation of generic convolutional and recurrent networks for sequence mod- eling”. In: arXiv preprint arXiv:1803.01271.

Box, George EP et al. (2015). Time series analysis: forecasting and control. John Wiley & Sons.

Brophy, Eoin et al. (2023). “Generative adversarial networks in time series: A systematic literature review”. In: ACM Computing Surveys 55.10, pp. 1– 31.

Buhlurl, Nikolaj (n.d.). Full guide to contrastive learning.

Cai, Changchun et al. (2021). “Short-Term Load Forecasting Based on Deep Learning Bidirectional LSTMNeural Network”. In:Applied Sciences 11.17.

Chen, Kaixuan et al. (2021). “Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities”. In: ACM Comput- ing Surveys (CSUR) 54.4, pp. 1–40.

Chen, Ting et al. (2020). “A simple framework for contrastive learning of visual representations”. In: International conference on machine learning. PMLR, pp. 1597–1607.

Chen, Yanping et al. (2013). “DTW-D: time series semi-supervised learning from a single example”. In: Proceedings of the 19th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. KDD ’13. Chicago, Illinois, USA: Association for Computing Machinery, pp. 383– 391.

Choi, Kukjin et al. (2021). “Deep learning for anomaly detection in time- series data: review, analysis, and guidelines”. In: IEEE Access 9, pp. 120043– 120065.

Chowdhury, Ranak Roy et al. (2022). “TARNet: Task-Aware Reconstruction for Time-Series Transformer”. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22. Wash- ington DC, USA: Association for Computing Machinery, pp. 212–220.

Cook, AA, G Misirli and Z Fan (2019). Anomaly detection for IoT time-series data: a survey. IEEE Internet Things J. 7 (7), 6481–6494 (2020).

Dancker, Jonte (Dec. 2022). A brief introduction to recurrent neural networks.

Dau, Hoang Anh et al. (Oct. 2018). The UCR Time Series Classification Archive. https://www.cs.ucr.edu/~eamonn/time_series_data_ 2018/.

Dempster, Angus, François Petitjean and Geoffrey I Webb (2020). “ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels”. In: Data Mining and Knowledge Discovery 34.5, pp. 1454–1495.

Devlin, Jacob et al. (2019). BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding.

Donahue, Jeff and Karen Simonyan (2019). “Large scale adversarial repres- entation learning”. In: Advances in neural information processing systems 32.

Dong, Jiaxiang et al. (2023). SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling.

Dosovitskiy, Alexey et al. (2021). An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale.

Duy, Vo Nguyen Le and Ichiro Takeuchi (2023). Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time- Series Detection.

“Dynamic Time Warping” (2007). In: Information Retrieval for Music and Motion. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 69–84.

Eldele, Emadeldeen, Mohamed Ragab, Zhenghua Chen, MinWu, Chee Keong Kwoh et al. (2021). “Time-series representation learning via temporal and contextual contrasting”. In: arXiv preprint arXiv:2106.14112.

Eldele, Emadeldeen, Mohamed Ragab, Zhenghua Chen, MinWu, Chee-Keong Kwoh et al. (2023). “Label-efficient time series representation learning: A review”. In: arXiv preprint arXiv:2302.06433.

“Flexible Dynamic Time Warping for Time Series Classification” (2015). In: Procedia Computer Science 51. International Conference On Computa- tional Science, ICCS 2015, pp. 2838–2842.

Franceschi, Jean-Yves, Aymeric Dieuleveut and Martin Jaggi (2019). “Un- supervised scalable representation learning for multivariate time series”. In: Advances in neural information processing systems 32.

Galvez, Reagan L. et al. (2018). “Object Detection Using Convolutional Neural Networks”. In: TENCON 2018 - 2018 IEEE Region 10 Confer- ence, pp. 2023–2027.

Glorot, Xavier and Yoshua Bengio (13–15 May 2010). “Understanding the difficulty of training deep feedforward neural networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Ed. by Yee Whye Teh and Mike Titterington. Vol. 9. Proceed- ings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR, pp. 249–256.

Goodfellow, Ian et al. (2014). “Generative adversarial nets”. In: Advances in neural information processing systems 27.

Gu, Fuqiang et al. (2021). “A survey on deep learning for human activity recognition”. In: ACM Computing Surveys (CSUR) 54.8, pp. 1–34.

Gulrajani, Ishaan et al. (2017). “Improved training of wasserstein gans”. In: Advances in neural information processing systems 30.

He, Kaiming et al. (2016). “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

Hendrycks, Dan and Kevin Gimpel (2023). Gaussian Error Linear Units (GELUs).

Huang, Fanling and Yangdong Deng (2023). “TCGAN: Convolutional Gen- erative Adversarial Network for time series classification and clustering”. In: Neural Networks 165, pp. 868–883.

Hughes, Mark et al. (2017). “Medical Text Classification using Convolutional Neural Networks”. In: Studies in health technology and informatics 235, pp. 246–250.

Iglesias, Guillermo et al. (2022). “Data Augmentation techniques in time series domain: a survey and taxonomy”. In: Neural Computing and Ap- plications 35, pp. 10123–10145.

Ingolfsson, Thorir Mar (Nov. 2021). Insights into LSTM architecture.

Ismail Fawaz, Hassan, Germain Forestier et al. (2019). “Deep learning for time series classification: a review”. In: Data mining and knowledge dis- covery 33.4, pp. 917–963.

Ismail Fawaz, Hassan, Benjamin Lucas et al. (2020). “Inceptiontime: Finding alexnet for time series classification”. In: Data Mining and Knowledge Discovery 34.6, pp. 1936–1962.

Iwana, Brian Kenji and Seiichi Uchida (July 2021). “An empirical survey of data augmentation for time series classification with neural networks”. In: PLOS ONE 16.7. Ed. by Friedhelm Schwenker, e0254841.

Kingma, Diederik P. and Jimmy Ba (2017). Adam: A Method for Stochastic Optimization.

Laurent, César et al. (2015). Batch Normalized Recurrent Neural Networks.

Li, Shiyang et al. (2019). “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting”. In: Advances in neural information processing systems 32.

Li, Zhe et al. (2023). Ti-MAE: Self-Supervised Masked Time Series Autoen- coders.

Liao, Shujian et al. (2023). Conditional Sig-Wasserstein GANs for Time Series Generation.

Lim, Bryan and Stefan Zohren (2021). “Time-series forecasting with deep learning: a survey”. In: Philosophical Transactions of the Royal Society A 379.2194, p. 20200209.

Malhotra, Pankaj et al. (Apr. 2017). “TimeNet: Pre-trained deep recurrent neural network for time series classification”. In.

Meng, Qianwen et al. (2023). “Unsupervised representation learning for time series: A review”. In: arXiv preprint arXiv:2308.01578.

Nie, Yuqi et al. (2023). A Time Series is Worth 64 Words: Long-term Fore- casting with Transformers.

Nikitin, Alexander (Apr. 2024). Time series augmentations.

Olah, Christopher (2015). Understanding LSTM networks.

Oord, Aaron van den et al. (2016). “Wavenet: A generative model for raw audio”. In: arXiv preprint arXiv:1609.03499.

Ouyang, Xi et al. (2015). “Sentiment Analysis Using Convolutional Neural Network”. In: 2015 IEEE International Conference on Computer and In- formation Technology; Ubiquitous Computing and Communications; De- pendable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pp. 2359–2364.

Poudel, Sushmita (Aug. 2023). Recurrent neural network (RNN) architecture explained.

Proceduralia (2019). Proceduralia/Pytorch-Gan-timeseries. https://github. com/proceduralia/pytorch-GAN-timeseries/.

Radford, Alec, Rafal Jozefowicz and Ilya Sutskever (2018). Learning To Gen- erate Reviews and Discovering Sentiment.

Radford, Alec, Luke Metz and Soumith Chintala (2015). “Unsupervised rep- resentation learning with deep convolutional generative adversarial net- works”. In: arXiv preprint arXiv:1511.06434.

Ruiz, Alejandro Pasos et al. (2021). “The great multivariate time series clas- sification bake off: a review and experimental evaluation of recent al- gorithmic advances”. In: Data Mining and Knowledge Discovery 35.2, pp. 401–449.

Salinas, David, Valentin Flunkert and Jan Gasthaus (2019). DeepAR: Prob- abilistic Forecasting with Autoregressive Recurrent Networks.

Salinas, David, Valentin Flunkert, Jan Gasthaus and Tim Januschowski (2020). “DeepAR: Probabilistic forecasting with autoregressive recurrent net- works”. In: International Journal of Forecasting 36.3, pp. 1181–1191.

Sezer, Omer Berat, Mehmet Ugur Gudelek and Ahmet Murat Ozbayoglu (2020). “Financial time series forecasting with deep learning: A systematic literature review: 2005–2019”. In: Applied soft computing 90, p. 106181.

Sharma, Neha, Vibhor Jain and Anju Mishra (2018). “An Analysis Of Con- volutional Neural Networks For Image Classification”. In: Procedia Com- puter Science 132. International Conference on Computational Intelli- gence and Data Science, pp. 377–384.

Shenfield, Alex and Martin Howarth (Sept. 2020). “A Novel Deep Learning Model for the Detection and Identification of Rolling Element-Bearing Faults”. In: Sensors (Basel, Switzerland) 20.

Shi, Xingjian et al. (2015). “Convolutional LSTM network: A machine learn- ing approach for precipitation nowcasting”. In: Advances in neural in- formation processing systems 28.

Shkulov, Valentine (May 2023). Advanced techniques for time series data feature engineering.

Sun, Chenxi et al. (2020). “A review of deep learning methods for irregularly sampled medical time series data”. In: arXiv preprint arXiv:2010.12493.

Szegedy, Christian et al. (2017). “Inception-v4, inception-resnet and the im- pact of residual connections on learning”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 31. 1.

Tonekaboni, Sana, Danny Eytan and Anna Goldenberg (2021). “Unsuper- vised representation learning for time series with temporal neighborhood coding”. In: arXiv preprint arXiv:2106.00750.

Trirat, Patara et al. (2024). “Universal Time-Series Representation Learning: A Survey”. In: arXiv preprint arXiv:2401.03717.

Tuli, Shreshth, Giuliano Casale and Nicholas R. Jennings (2022). TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data.

Vaswani, Ashish et al. (2017). “Attention is all you need”. In: Advances in neural information processing systems 30.

Wang, Lei and Piotr Koniusz (2022). Uncertainty-DTW for Time Series and Sequences.

Wang, Yinhai, Zhiyong Cui and Ruimin Ke (2023). “Chapter 3 - Machine learning basics”. In: Machine Learning for Transportation Research and Applications. Ed. by Yinhai Wang, Zhiyong Cui and Ruimin Ke. Elsevier, pp. 25–40.

Wang, Zhengwei, Qi She and Tomas E Ward (2021). “Generative adversarial networks in computer vision: A survey and taxonomy”. In: ACM Com- puting Surveys (CSUR) 54.2, pp. 1–38.

Wang, Zhiguang, Weizhong Yan and Tim Oates (2017). “Time series classi- fication from scratch with deep neural networks: A strong baseline”. In: 2017 International joint conference on neural networks (IJCNN). IEEE, pp. 1578–1585.

Wen, Qingsong et al. (2020). “Time series data augmentation for deep learn- ing: A survey”. In: arXiv preprint arXiv:2002.12478.

Witter, R Teal (2023).

Yang, Xinyu, Zhenguo Zhang and Rongyi Cui (2022). “Timeclr: A self- supervised contrastive learning framework for univariate time series rep- resentation”. In: Knowledge-Based Systems 245, p. 108606.

Yang, Zhenyu, Yantao Li and Gang Zhou (Apr. 2023). “TS-GAN: Time-series GAN for Sensor-based Health Data Augmentation”. In: ACM Trans. Comput. Healthcare 4.2.

Yoon, Jinsung, Daniel Jarrett and Mihaela Van der Schaar (2019). “Time- series generative adversarial networks”. In: Advances in neural informa- tion processing systems 32.

Young, Tom et al. (2018). “Recent trends in deep learning based natural language processing”. In: ieee Computational intelligenCe magazine 13.3, pp. 55–75.

Yue, Zhihan et al. (2022). “Ts2vec: Towards universal representation of time series”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. 8, pp. 8980–8987.

Zeng, Zhen et al. (2023). “Financial time series forecasting using CNN and transformer”. In: arXiv preprint arXiv:2304.04912.

Zerveas, George et al. (2021). “A transformer-based framework for multivari- ate time series representation learning”. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp. 2114– 2124.

Zhang, G.Peter (2003). “Time series forecasting using a hybrid ARIMA and neural network model”. In: Neurocomputing 50, pp. 159–175.

Appendix A

Architecture Details

The detailed description of the architectures for the LSTM-based generator and discriminator (critic) employed within the LSTM-WGAN-GP frame- work, are provided in Table A.1 and Table A.2.

A.1 GAN Architecture

Layer Index Layer Type Configuration Output Shape

0 Input Noise LSTM(Seq Len, hid dim) (Batch Size, Seq Len, hid dim) 1 Activation LeakyReLU (Batch Size, Seq Len, hid dim) 2 Output Linear(hid dim, n features) (Batch Size, Seq Len, n features)

Table A.1: Generator Architecture

Layer Index Layer Type Configuration Output Shape

0 Input Data LSTM(Seq Len, hid dim) (Batch Size, Seq Len, hid dim) 1 Activation LeakyReLU (Batch Size, Seq Len, hid dim) 2 Linear Output Linear(hid dim, 1) (Batch Size, Seq Len, 1) 3 Aggregation Mean over Seq Len (Batch Size, 1)

Table A.2: Discriminator (Critic) Architecture

The Seq Len and n features parameters are dataset-specific and are de- rived from the shape of the input data as n instances x n timestamps x

n features. In our implementation, we empirically chose a hidden dimen- sion (hid dim) size of 256 for the LSTM-based generator and discriminator architectures.

A.2 Random Cropping

Random cropping is utilized to create new contexts for any input time series xi. For any time series input xi ∈ RT×F , TS2Vec randomly samples two overlapping time segments [a1, b1] and [a2, b2] such that 0 < a1 ≤ a2 ≤ b1 ≤ b2 ≤ T . The contextual representations within the overlapping segment [a2, b1] are required to be consistent across both contexts (Yue et al. 2022).

Figure A.1: Random Cropping to create new contexts.

Appendix B

Experimental Details

Dataset Train Samples Test Samples Classes Length

Adiac 390 391 37 176 ArrowHead 36 175 3 251 Beef 30 30 5 470 BeetleFly 20 20 2 512 BirdChicken 20 20 2 512 Car 60 60 4 577 CBF 30 900 3 128 ChlorineConcentration 467 3840 3 166 CinCECGTorso 40 1380 4 1639 Coffee 28 28 2 286 Computers 250 250 2 720 CricketX 390 390 12 300 CricketY 390 390 12 300 CricketZ 390 390 12 300 DiatomSizeReduction 16 306 4 345 DistalPhalanxOutlineAgeGroup 400 139 3 80 DistalPhalanxOutlineCorrect 600 276 2 80 DistalPhalanxTW 400 139 6 80 Earthquakes 322 139 2 512 ECG200 100 100 2 96 ECG5000 500 4500 5 140 ECGFiveDays 23 861 2 136 ElectricDevices 8926 7711 7 96 FaceAll 560 1690 14 131 FaceFour 24 88 4 350

Dataset Train Samples Test Samples Classes Length

FacesUCR 200 2050 14 131 FiftyWords 450 455 50 270 Fish 175 175 7 463 FordA 3601 1320 2 500 FordB 3636 810 2 500 GunPoint 50 150 2 150 Ham 109 105 2 431 HandOutlines 1000 370 2 2709 Haptics 155 308 5 1092 Herring 64 64 2 512 InlineSkate 100 550 7 1882 InsectWingbeatSound 220 1980 11 256 ItalyPowerDemand 67 1029 2 24 LargeKitchenAppliances 375 375 3 720 Lightning2 60 61 2 637 Lightning7 70 73 7 319 Mallat 55 2345 8 1024 Meat 60 60 3 448 MedicalImages 381 760 10 99 MiddlePhalanxOutlineAgeGroup 400 154 3 80 MiddlePhalanxOutlineCorrect 600 291 2 80 MiddlePhalanxTW 399 154 6 80 MoteStrain 20 1252 2 84 NonInvasiveFetalECGThorax1 1800 1965 42 750 NonInvasiveFetalECGThorax2 1800 1965 42 750 OliveOil 30 30 4 570 OSULeaf 200 242 6 427 PhalangesOutlinesCorrect 1800 858 2 80 Phoneme 214 1896 39 1024 Plane 105 105 7 144 ProximalPhalanxOutlineAgeGroup 400 205 3 80 ProximalPhalanxOutlineCorrect 600 291 2 80 ProximalPhalanxTW 400 205 6 80 RefrigerationDevices 375 375 3 720 ScreenType 375 375 3 720 ShapeletSim 20 180 2 500 ShapesAll 600 600 60 512 SmallKitchenAppliances 375 375 3 720 SonyAIBORobotSurface1 20 601 2 70

Dataset Train Samples Test Samples Classes Length

SonyAIBORobotSurface2 27 953 2 65 StarLightCurves 1000 8236 3 1024 Strawberry 613 370 2 235 SwedishLeaf 500 625 15 128 Symbols 25 995 6 398 SyntheticControl 300 300 6 60 ToeSegmentation1 40 228 2 277 ToeSegmentation2 36 130 2 343 Trace 100 100 4 275 TwoLeadECG 23 1139 2 82 TwoPatterns 1000 4000 4 128 UWaveGestureLibraryAll 896 3582 8 945 UWaveGestureLibraryX 896 3582 8 315 UWaveGestureLibraryY 896 3582 8 315 UWaveGestureLibraryZ 896 3582 8 315 Wafer 1000 6164 2 152 Wine 57 54 2 234 WordSynonyms 267 638 25 270 Worms 181 77 5 900 WormsTwoClass 181 77 2 900 Yoga 300 3000 2 426 ACSF1 100 100 10 1460 AllGestureWiimoteX 300 700 10 Vary AllGestureWiimoteY 300 700 10 Vary AllGestureWiimoteZ 300 700 10 Vary BME 30 150 3 128 Chinatown 20 343 2 24 Crop 7200 16800 24 46 DodgerLoopDay 78 80 7 288 DodgerLoopGame 20 138 2 288 DodgerLoopWeekend 20 138 2 288 EOGHorizontalSignal 362 362 12 1250 EOGVerticalSignal 362 362 12 1250 EthanolLevel 504 500 4 1751 FreezerRegularTrain 150 2850 2 301 FreezerSmallTrain 28 2850 2 301 Fungi 18 186 18 201 GestureMidAirD1 208 130 26 Vary GestureMidAirD2 208 130 26 Vary

Dataset Train Samples Test Samples Classes Length

GestureMidAirD3 208 130 26 Vary GesturePebbleZ1 132 172 6 Vary GesturePebbleZ2 146 158 6 Vary GunPointAgeSpan 135 316 2 150 GunPointMaleVersusFemale 135 316 2 150 GunPointOldVersusYoung 136 315 2 150 HouseTwenty 40 119 2 2000 InsectEPGRegularTrain 62 249 3 601 InsectEPGSmallTrain 17 249 3 601 MelbournePedestrian 1194 2439 10 24 MixedShapesRegularTrain 500 2425 5 1024 MixedShapesSmallTrain 100 2425 5 1024 PickupGestureWiimoteZ 50 50 10 Vary PigAirwayPressure 104 208 52 2000 PigArtPressure 104 208 52 2000 PigCVP 104 208 52 2000 PLAID 537 537 11 Vary PowerCons 180 180 2 144 Rock 20 50 4 2844 SemgHandGenderCh2 300 600 2 1500 SemgHandMovementCh2 450 450 6 1500 SemgHandSubjectCh2 450 450 5 1500 ShakeGestureWiimoteZ 50 50 10 Vary SmoothSubspace 150 150 3 15 UMD 36 144 3 150

Table B.2: A summary of the 128 UCR Univariate datasets (Dau et al. 2018)

Dataset TS2Vec Synth-TS2Vec

B=4 B=8 B=16 B=4 B=8 B=16

Adiac 0.775 0.762 0.765 0.790 0.787 0.770 ArrowHead 0.823 0.857 0.817 0.830 0.850 0.823 Beef 0.767 0.767 0.633 0.800 0.760 0.800 BeetleFly 0.850 0.900 0.900 0.830 0.900 0.900 BirdChicken 0.800 0.800 0.800 0.800 0.800 0.800 Car 0.883 0.883 0.700 0.860 0.900 0.850 CBF 1.000 1.000 1.000 1.000 1.000 1.000 ChlorineConcentration 0.810 0.832 0.812 0.770 0.830 0.773

Dataset TS2Vec Synth-TS2Vec

B=4 B=8 B=16 B=4 B=8 B=16

CinCECGTorso 0.812 0.827 0.825 0.812 0.827 0.825 Coffee 1.000 1.000 1.000 1.000 1.000 1.000 Computers 0.636 0.660 0.660 0.644 0.641 0.786 CricketX 0.800 0.782 0.805 0.810 0.790 0.782 CricketY 0.756 0.749 0.769 0.780 0.760 0.790 CricketZ 0.785 0.792 0.790 0.782 0.800 0.983 DiatomSizeReduction 0.980 0.984 0.987 0.977 0.984 0.900 DistalPhalanxOutlineCorrect 0.775 0.761 0.757 0.761 0.780 0.700 DistalPhalanxOutlineAgeGroup 0.719 0.727 0.719 0.740 0.734 0.725 DistalPhalanxTW 0.662 0.698 0.683 0.705 0.712 0.748 Earthquakes 0.748 0.748 0.748 0.748 0.750 0.880 ECG200 0.890 0.920 0.880 0.890 0.910 0.943 ECG5000 0.935 0.935 0.934 0.942 0.942 0.998 ECGFiveDays 0.999 1.000 1.000 1.000 1.000 1.000 ElectricDevices 0.712 0.721 0.719 0.736 0.730 0.778 FaceAll 0.759 0.771 0.805 0.784 0.790 0.823 FaceFour 0.864 0.932 0.932 0.931 0.900 0.916 FacesUCR 0.930 0.924 0.923 0.919 0.920 0.900 FiftyWords 0.771 0.771 0.774 0.797 0.780 0.774 Fish 0.937 0.926 0.937 0.926 0.920 0.937 FordA 0.940 0.936 0.948 0.943 0.940 0.948 FordB 0.789 0.794 0.807 0.789 0.794 0.807 GunPoint 0.980 0.980 0.987 0.980 0.980 0.986 Ham 0.714 0.714 0.724 0.714 0.760 0.657 HandOutlines 0.919 0.922 0.930 0.919 0.922 0.930 Haptics 0.510 0.526 0.536 0.530 0.526 0.536 Herring 0.625 0.641 0.609 0.600 0.580 0.625 InlineSkate 0.389 0.415 0.407 0.389 0.415 0.407 InsectWingbeatSound 0.629 0.630 0.624 0.640 0.653 0.634 ItalyPowerDemand 0.961 0.925 0.960 0.963 0.9350 0.970 LargeKitchenAppliances 0.845 0.845 0.875 0.776 0.820 0.875 Lightning2 0.836 0.869 0.820 0.852 0.930 0.820 Lightning7 0.836 0.863 0.822 0.780 0.823 0.808 Mallat 0.915 0.914 0.873 0.934 0.914 0.873 Meat 0.950 0.950 0.937 0.930 0.940 0.920 MedicalImages 0.792 0.789 0.793 0.770 0.790 0.792 MiddlePhalanxOutlineCorrect 0.811 0.838 0.825 0.811 0.830 0.850 MiddlePhalanxOutlineAgeGroup 0.636 0.636 0.630 0.630 0.630 0.630 MiddlePhalanxTW 0.591 0.584 0.578 0.610 0.590 0.578 MoteStrain 0.857 0.861 0.863 0.872 0.870 0.850

Dataset TS2Vec Synth-TS2Vec

B=4 B=8 B=16 B=4 B=8 B=16

NonInvasiveFetalECGThorax1 0.923 0.930 0.919 0.923 0.930 0.919 NonInvasiveFetalECGThorax2 0.940 0.938 0.935 0.940 0.938 0.935 OliveOil 0.900 0.900 0.900 0.900 0.900 0.933 OSULeaf 0.876 0.851 0.843 0.850 0.860 0.814 PhalangesOutlinesCorrect 0.795 0.809 0.823 0.797 0.820 0.824 Phoneme 0.296 0.312 0.309 0.302 0.312 0.309 Plane 1.000 1.000 0.990 1.000 1.000 0.962 ProximalPhalanxOutlineCorrect 0.876 0.887 0.900 0.903 0.910 0.900 ProximalPhalanxOutlineAgeGroup 0.844 0.834 0.829 0.863 0.850 0.849 ProximalPhalanxTW 0.785 0.824 0.805 0.819 0.815 0.790 RefrigerationDevices 0.587 0.589 0.589 0.560 0.580 0.589 ScreenType 0.405 0.411 0.397 0.421 0.430 0.397 ShapeletSim 0.989 0.900 0.994 0.980 1.000 0.850 ShapesAll 0.897 0.902 0.900 0.892 0.900 0.880 SmallKitchenAppliances 0.723 0.731 0.900 0.744 0.740 0.733 SonyAIBORobotSurface1 0.874 0.900 0.900 0.865 0.900 0.855 SonyAIBORobotSurface2 0.890 0.871 0.889 0.916 0.910 0.848 StarLightCurves 0.970 0.969 0.900 0.970 0.969 0.971 Strawberry 0.962 0.962 0.900 0.967 0.970 0.984 SwedishLeaf 0.939 0.941 0.900 0.937 0.940 0.944 Symbols 0.973 0.976 0.972 0.973 0.960 0.950 SyntheticControl 0.997 0.997 0.993 1.000 1.000 0.996 ToeSegmentation1 0.930 0.917 0.947 0.912 0.930 0.890 ToeSegmentation2 0.915 0.892 0.900 0.920 0.931 0.908 Trace 1.000 1.000 1.000 1.000 1.000 1.000 TwoLeadECG 0.982 0.986 0.987 0.964 0.970 0.981 TwoPatterns 1.000 1.000 1.000 1.000 1.000 1.000 UWaveGestureLibraryX 0.810 0.795 0.801 0.823 0.823 0.812 UWaveGestureLibraryY 0.729 0.719 0.720 0.743 0.750 0.751 UWaveGestureLibraryZ 0.761 0.770 0.768 0.766 0.770 0.773 UWaveGestureLibraryAll 0.934 0.930 0.934 0.951 0.930 0.934 Wafer 0.995 0.998 0.997 1.000 1.000 0.997 Wine 0.778 0.870 0.889 0.870 0.910 0.852 WordSynonyms 0.699 0.676 0.704 0.694 0.692 0.705 Worms 0.701 0.701 0.701 0.700 0.701 0.701 WormsTwoClass 0.805 0.805 0.753 0.800 0.805 0.753 Yoga 0.880 0.887 0.877 0.853 0.880 0.872 ACSF1 0.840 0.900 0.840 0.900 0.910 0.910 AllGestureWiimoteX 0.744 0.777 0.751 0.740 0.760 0.763 AllGestureWiimoteY 0.764 0.793 0.774 0.760 0.778 0.784

Dataset TS2Vec Synth-TS2Vec

B=4 B=8 B=16 B=4 B=8 B=16

AllGestureWiimoteZ 0.734 0.746 0.770 0.720 0.700 0.704 BME 0.973 0.993 0.980 0.970 0.990 0.987 Chinatown 0.968 0.965 0.959 0.970 0.978 0.975 Crop 0.753 0.756 0.753 0.760 0.760 0.747 EOGHorizontalSignal 0.544 0.539 0.522 0.544 0.539 0.522 EOGVerticalSignal 0.467 0.503 0.472 0.467 0.503 0.472 EthanolLevel 0.480 0.468 0.484 0.480 0.468 0.484 FreezerRegularTrain 0.985 0.986 0.983 1.000 1.000 0.996 FreezerSmallTrain 0.894 0.870 0.872 1.000 1.000 0.996 Fungi 0.962 0.957 0.946 0.900 0.920 0.900 GestureMidAirD1 0.631 0.608 0.615 0.600 0.600 0.530 GestureMidAirD2 0.508 0.469 0.515 1.000 1.000 0.996 GestureMidAirD3 0.346 0.292 0.300 0.300 0.280 0.315 GesturePebbleZ1 0.878 0.930 0.884 0.850 0.900 0.808 GesturePebbleZ2 0.842 0.873 0.848 0.800 0.800 0.728 GunPointAgeSpan 0.994 0.987 0.968 0.990 1.000 0.965 GunPointMaleVersusFemale 1.000 1.000 1.000 1.000 1.000 1.000 GunPointOldVersusYoung 1.000 1.000 1.000 1.000 1.000 1.000 HouseTwenty 0.941 0.916 0.941 0.941 0.916 0.941 InsectEPGRegularTrain 1.000 1.000 1.000 1.000 1.000 1.000 InsectEPGSmallTrain 1.000 1.000 1.000 1.000 1.000 1.000 MelbournePedestrian 0.954 0.959 0.956 0.940 0.941 0.940 MixedShapesRegularTrain 0.915 0.917 0.922 0.940 0.959 0.956 MixedShapesSmallTrain 0.881 0.861 0.856 0.900 0.917 0.922 PickupGestureWiimoteZ 0.800 0.820 0.760 0.800 0.740 0.712 PigAirwayPressure 0.524 0.630 0.683 0.524 0.630 0.683 PigArtPressure 0.962 0.966 0.966 0.962 0.966 0.966 PigCVP 0.803 0.812 0.870 0.803 0.812 0.870 PLAID 0.551 0.561 0.549 0.551 0.561 0.549 PowerCons 0.967 0.961 0.972 0.988 0.970 0.978 Rock 0.660 0.700 0.700 0.660 0.700 0.700 SemgHandGenderCh2 0.952 0.963 0.962 0.952 0.963 0.962 SemgHandMovementCh2 0.893 0.860 0.891 0.893 0.860 0.891 SemgHandSubjectCh2 0.944 0.951 0.942 0.944 0.951 0.942 ShakeGestureWiimoteZ 0.940 0.940 0.920 0.900 0.920 0.900 SmoothSubspace 0.967 0.980 0.993 0.969 0.981 0.960 UMD 1.000 1.000 0.993 1.000 1.000 0.993 DodgerLoopDay 0.425 0.562 0.500 0.425 0.562 0.500 DodgerLoopGame 0.826 0.841 0.819 0.826 0.841 0.819 DodgerLoopWeekend 0.942 0.964 0.942 0.942 0.964 0.942

Dataset TS2Vec Synth-TS2Vec

B=4 B=8 B=16 B=4 B=8 B=16

On the first 125 datasets: AVG 0.824 0.830 0.827 0.825 0.832 0.828

Table B.6: Results of TS2Vec vs Synth-TS2Vec

Dataset Train Cases Test Cases Dimensions Length Classes

ArticularyWordRecognition 275 300 9 144 25 AtrialFibrillation 15 15 2 640 3 BasicMotions 40 40 6 100 4 CharacterTrajectories 1422 1436 3 182 20 Cricket 108 72 6 1197 12 DuckDuckGeese 60 40 1345 270 5 EigenWorms 128 131 6 17984 5 Epilepsy 137 138 3 206 4 EthanolConcentration 261 263 3 1751 4 ERing 30 30 4 65 6 FaceDetection 5890 3524 144 62 2 FingerMovements 316 100 28 50 2 HandMovementDirection 320 147 10 400 4 Handwriting 150 850 3 152 26 Heartbeat 204 205 61 405 2 JapaneseVowels 270 370 12 29 9 Libras 180 180 2 45 15 LSST 2459 2466 6 36 14 InsectWingbeat 30000 20000 200 78 10 MotorImagery 278 100 64 3000 2 NATOPS 180 180 24 51 6 PenDigits 7494 3498 2 8 10 PEMS-SF 267 173 963 144 7 Phoneme 3315 3353 11 217 39 RacketSports 151 152 6 30 4 SelfRegulationSCP1 268 293 6 896 2 SelfRegulationSCP2 200 180 7 1152 2 SpokenArabicDigits 6599 2199 13 93 10 StandWalkJump 12 15 4 2500 3 UWaveGestureLibrary 120 320 3 315 8

Table B.1: A Summary of the 30 UEA Multivariate datasets (Bagnall et al. 2018)

Dataset TS2Vec Synth-TS2Vec

ArticularyWordRecognition 0.987 0.986 AtrialFibrillation 0.200 0.200 BasicMotions 0.975 1.000 CharacterTrajectories 0.995 0.994 Cricket 0.972 0.972 DuckDuckGeese 0.680 0.600 EigenWorms 0.847 0.850 Epilepsy 0.964 0.964 ERing 0.874 0.874 EthanolConcentration 0.308 0.538 FaceDetection 0.501 0.530 FingerMovements 0.480 0.550 HandMovementDirection 0.338 0.324 Handwriting 0.515 0.742 Heartbeat 0.683 0.737 JapaneseVowels 0.984 0.981 Libras 0.867 0.866 LSST 0.537 0.562 MotorImagery 0.510 0.510 NATOPS 0.928 0.936 PEMS-SF 0.682 0.666 PenDigits 0.989 0.987 PhonemeSpectra 0.233 0.256 RacketSports 0.855 0.868 SelfRegulationSCP1 0.812 0.846 SelfRegulationSCP2 0.578 0.578 SpokenArabicDigits 0.988 0.988 StandWalkJump 0.467 0.766 UWaveGestureLibrary 0.906 0.900 InsectWingbeat 0.466 0.460

On the first 29 datasets: AVG 0.712 0.7438

Table B.3: TS2Vec vs Synth-TS2Vec

Config Value

optimizer Adam (Kingma and J. Ba 2017) learning rate 0.001 Batch Size 16 Epochs 50

Table B.4: Default settings of lstmwgan-gp

Config Value

optimizer Adam (Kingma and J. Ba 2017) learning rate 0.001 Batch Size (4,8,16)

Residual Blocks 10 Self-Attention Heads 4

Table B.5: Default settings of Synth-TS2Vec

Figure B.1: Timestamp masking in Synth-TS2Vec.

Like this project

Posted May 26, 2025