Introduction
The proliferation of digital commerce platforms over the past two decades has generated an unprecedented volume of user-generated textual content in the form of product reviews, ratings, and evaluative commentary. As consumers increasingly rely upon the opinions of prior purchasers when making purchasing decisions, and as businesses seek scalable methods for monitoring customer satisfaction and product reputation across large review repositories, the automated analysis of sentiment expressed in natural language text has emerged as one of the most practically consequential applications of computational linguistics and machine learning. The scale of this phenomenon is particularly pronounced in markets where a dominant domestic e-commerce platform concentrates a large proportion of consumer activity: the Polish market, in which the Allegro marketplace handles tens of millions of transactions annually, exemplifies this pattern and generates review corpora of sufficient scale to support rigorous empirical evaluation of competing natural language processing methodologies.
Within the broader landscape of natural language processing research, the task of sentiment classification — the automated assignment of polarity labels such as positive, negative, or neutral to natural language texts — has served as a canonical evaluation benchmark for successive generations of machine learning architectures. Recurrent neural networks, and Long Short-Term Memory architectures in particular, represented the dominant paradigm for sequence modelling tasks throughout the mid-2010s, offering a principled mechanism for capturing temporal dependencies across variable-length token sequences while mitigating the vanishing gradient pathologies that had limited earlier recurrent architectures. The introduction of the Transformer architecture, founded upon the self-attention mechanism, subsequently displaced recurrent models as the state-of-the-art approach across a wide range of natural language processing benchmarks, and the subsequent development of large-scale pre-trained language models based upon the Transformer encoder — including BERT and its numerous language-specific derivatives — established a new paradigm of transfer learning in which general-purpose contextual representations trained on large unlabelled corpora are fine-tuned for specific downstream tasks.
The transition from recurrent to Transformer-based architectures has been extensively studied in the context of high-resource languages, particularly English, where large pre-trained models have consistently demonstrated substantial performance advantages over LSTM-based baselines across sentiment analysis, question answering, named entity recognition, and a range of other natural language processing tasks. The situation is considerably less well characterised for morphologically complex languages such as Polish, in which the large surface-form vocabulary generated by rich inflectional morphology poses distinct challenges for both tokenisation design and embedding generalisation that may modulate the relative advantage of Transformer-based models. Although Polish-adapted pre-trained language models, including HerBERT and Polish BERT, have been developed and evaluated on general-purpose natural language processing benchmarks, a systematic and methodologically rigorous direct comparison of Transformer-based and LSTM-based architectures specifically in the domain of Polish-language product review sentiment classification has not, to the best of the present author's knowledge, been reported in the published literature. This gap motivates the investigation undertaken in the present thesis.
The central research objective of this work is to conduct a controlled comparative evaluation of LSTM-based and Transformer-based neural architectures applied to the task of three-class sentiment classification of Polish-language product reviews, with the aim of producing statistically defensible conclusions about the relative merits of the two architectural paradigms in this specific linguistic and domain context. The primary corpus employed for this evaluation is the Allegro Reviews Dataset, the largest publicly available Polish-language product review collection, which is supplemented by the PolEmo 2.0 dataset to assess cross-domain generalisation. The LSTM-based models evaluated include unidirectional and bidirectional configurations, with and without pre-trained fastText embeddings, providing a set of baselines that isolate the contributions of architectural complexity and transfer learning independently. The Transformer-based models evaluated are HerBERT and Polish BERT, both of which were pre-trained on large Polish-language corpora and fine-tuned on the target task using standard procedures from the transfer learning literature.
The research hypothesis guiding the experimental design is that Transformer-based models, by virtue of their bidirectional contextual representations, large Polish pre-training corpora, and Polish-specific subword vocabularies, will demonstrate statistically significant superiority over LSTM-based baselines in sentiment classification of Polish product reviews. This hypothesis is motivated by the theoretical analysis of the two architectural families presented in Chapter 1, which identifies the capacity of self-attention mechanisms to integrate information across arbitrarily long spans without sequential compression as the principal architectural advantage of Transformer models, and by the documented effectiveness of Polish pre-trained models on related downstream tasks. At the same time, the hypothesis is qualified by the recognition that LSTM-based models, as computationally lighter architectures whose inference costs are substantially lower than those of large Transformer models, may represent a practically preferable option in resource-constrained deployment scenarios where performance differences are modest in absolute terms.
The evaluation framework employed in this study was designed to ensure that the empirical conclusions are robust to methodological choices and to the class imbalance characteristic of naturally occurring product review distributions. The primary evaluation metric is macro-averaged F1-score, which treats all three sentiment classes — positive, negative, and neutral — as equally important regardless of their relative frequency in the test set. This choice is supplemented by per-class precision and recall as diagnostic metrics and by macro-averaged area under the receiver operating characteristic curve as a threshold-independent complement. Uncertainty in all reported metrics is quantified through bootstrap confidence intervals, and pairwise differences between model configurations are assessed for statistical significance using McNemar's test with Holm–Bonferroni correction for multiple comparisons. This combination of evaluation procedures reflects established best practices for classifier comparison under imbalanced and multi-class conditions.
The present thesis is structured as follows. Chapter 1 provides the theoretical foundations necessary for interpreting the experimental results. It introduces the field of sentiment analysis, its principal tasks and applications, and the specific challenges posed by Polish as a morphologically complex language for sentiment classification systems. The chapter then surveys the development of LSTM-based sequence modelling architectures, discussing the gated recurrent unit mechanism, the role of pre-trained word embeddings in mitigating data scarcity, and the documented limitations of sequential compression for capturing long-range dependencies. The chapter proceeds to examine the Transformer architecture and the self-attention mechanism, reviews the pre-training and fine-tuning paradigm established by BERT, and describes the Polish-adapted pre-trained models — HerBERT and Polish BERT — selected for evaluation in this study. The chapter concludes by situating the present research question within the landscape of existing comparative studies and identifying the specific gap that the present thesis addresses.
Chapter 2 documents the dataset construction, preprocessing procedures, and experimental methodology in sufficient detail to permit independent replication of the reported results. It describes the Allegro Reviews Dataset and the PolEmo 2.0 corpus, their statistical characteristics, and the procedures employed to construct training, validation, and test partitions with balanced class representation. The preprocessing pipeline — including tokenisation, lemmatisation for LSTM models, and WordPiece tokenisation for Transformer models — is described in full, as are the hyperparameter selection procedures, training schedules, and regularisation configurations applied to each model family. The evaluation framework and statistical analysis procedures are introduced and justified with reference to established methodological literature.
Chapter 3 presents and interprets the empirical results of the comparative evaluation. It reports the classification performance of LSTM configurations on the Allegro test set, documenting the contributions of bidirectionality, stacking, and pre-trained embeddings to overall macro-F1 performance. It then reports the performance of fine-tuned HerBERT and Polish BERT models, analyses the statistical significance of observed performance differences relative to the best LSTM baseline, and evaluates cross-domain generalisation on the PolEmo 2.0 benchmark. The chapter concludes with a detailed error analysis that examines the types of misclassifications produced by each model family, identifying the linguistic phenomena — including negation scope resolution, long-range contextual dependencies, and domain-specific technical vocabulary — that most sharply differentiate transformer and recurrent representations in this domain. The Conclusion synthesises the principal findings of the study, assesses the degree to which the original research hypothesis has been confirmed, reflects upon the practical implications of the performance-cost trade-off between the two architectural families, and identifies the most productive directions for future work extending the present investigation.
The contributions of this thesis are threefold. First, it provides the first systematic, statistically rigorous direct comparison of LSTM-based and Transformer-based models for three-class sentiment classification of Polish-language product reviews, filling a documented gap in the published literature on Polish natural language processing. Second, it establishes a methodologically transparent evaluation framework — including bootstrap confidence intervals, McNemar tests with multiple comparison correction, and cross-domain generalisation assessment — that may serve as a reference design for future comparative studies in this domain. Third, its detailed error analysis moves beyond aggregate accuracy comparisons to identify the specific linguistic phenomena driving performance differences between architectural families, providing a principled mechanistic basis for understanding when and why each architecture succeeds or fails on Polish-language sentiment classification tasks. Together, these contributions advance the understanding of neural sentiment analysis for morphologically complex languages and provide a foundation upon which subsequent research in aspect-level analysis, cross-domain transfer, and parameter-efficient fine-tuning may build.
Chapter 1: Theoretical Foundations of Sentiment Analysis and Neural Language Models
1.1. The Nature and Scope of Sentiment Analysis
Sentiment analysis, also referred to in the literature as opinion mining, constitutes a subdiscipline of natural language processing (NLP) concerned with the computational identification, extraction, and quantification of subjective information from natural language text [1]. At its most fundamental level, the field addresses the question of how positive, negative, or neutral orientations expressed by human authors can be reliably detected and classified by automated systems operating over large corpora of written language. The distinction between sentiment analysis and adjacent NLP tasks such as named entity recognition, information extraction, and syntactic parsing lies principally in its focus on affective and evaluative meaning rather than factual content: whereas an information extraction system seeks to identify who did what to whom, a sentiment analysis system seeks to determine whether a given author approves or disapproves, and to what degree, of the entities or events described in the text [3].
The field is organized around a taxonomy of tasks that differ in the granularity of analysis and the nature of the output produced. At the coarsest level, document-level polarity detection assigns a single sentiment label — commonly drawn from the set {positive, negative, neutral} — to an entire document, treating it as the expression of a unified opinion [2]. This approach is appropriate for relatively short, homogeneous texts such as product reviews, where a single purchase occasion motivates the review and the author's global sentiment is the primary signal of interest. At a finer level of granularity, sentence-level sentiment analysis decomposes the input into individual propositions, recognizing that a single review may contain sentences of mixed polarity: a reviewer may praise the product's durability while criticizing its design. The most granular level of analysis is provided by aspect-based sentiment analysis (ABSA), which decomposes opinion into a set of (target entity, attribute, sentiment polarity) triples, enabling systems to distinguish between, for example, a positive evaluation of a smartphone's camera and a negative evaluation of its battery life within a single text [1].
Emotion recognition is treated in the literature as a distinct but closely related subfield, mapping text onto categorical emotion models rather than simple polarity labels [2]. Two primary representational frameworks are employed: categorical models, such as Ekman's taxonomy of six basic emotions (happiness, sadness, anger, fear, disgust, and surprise), and dimensional models operating in continuous valence-arousal-dominance (VAD) spaces, where valence encodes pleasantness, arousal encodes activation level, and dominance encodes the degree of control experienced. Sentiment polarity can be understood as a projection of this higher-dimensional emotional space onto a single axis, which explains why sentiment classifiers trained on polarity labels may confuse semantically distinct emotions that happen to share a positive or negative valence. Within practical NLP pipelines, a subjectivity detection step is frequently inserted prior to polarity classification: the system first determines whether a sentence expresses an opinion at all, filtering out purely factual statements before applying the more computationally expensive polarity classifier [3].
The field of sentiment analysis has found application across a wide range of practical domains, each of which imposes its own constraints on the design of the analysis pipeline. In e-commerce and product review analysis — the domain of primary relevance to the present thesis — sentiment analysis enables retailers and manufacturers to aggregate customer feedback at scale, identify product defects surfaced in reviews, and monitor competitive positioning [1]. In social media monitoring, the challenge is compounded by the brevity, informality, and high noise level of microblog text such as posts on platform X. Financial sentiment analysis seeks to predict market movements from the aggregate tone of news articles and analyst reports, where even small shifts in sentiment distribution carry economic significance [3]. In each domain, the central computational challenges remain consistent: handling linguistic negation (which inverts polarity), detecting sarcasm and irony (where the surface polarity of the words employed is opposite to the intended communicative meaning), coping with domain-specific evaluative lexica (where words that are positive in one domain may be neutral or negative in another), recognizing implicit sentiment (where no explicit evaluative term appears but the described state of affairs carries a clear valence), and generalizing across domains and registers [2]. These challenges motivate the use of deep learning architectures capable of learning rich contextual representations, rather than hand-crafted lexicon-based approaches that cannot generalize beyond their enumerated vocabulary.
The practical significance of robust sentiment analysis tools for Polish-language content has been underscored by several recent studies. Research comparing commercial sentiment analysis services on Polish and English text has demonstrated that Polish consistently poses greater difficulties for automated systems, with accuracy metrics for Polish generally lagging behind their English counterparts across multiple content types and evaluation settings [3]. This gap is attributable to a combination of factors: the relative scarcity of large annotated Polish corpora compared to English, the morphological complexity of the language (discussed in detail in the following subchapter), and the historical predominance of English-language training data in commercially deployed NLP models. The construction of dedicated Polish sentiment corpora — including the MultiEmo dataset covering Polish sentiment across multiple domains [2] and specialized datasets of Polish book reviews [1] — represents a sustained effort by the Polish NLP research community to address these resource gaps and enable fair evaluation of Polish-specific model architectures.
1.2. Challenges of Polish-Language Text Processing
Polish is a member of the West Slavic branch of the Indo-European language family, closely related to Czech and Slovak. It is classified typologically as a fusional language, in contrast to the isolating typology of English, where grammatical relationships are primarily expressed through fixed word order and auxiliary words rather than morphological inflection [11]. This typological distinction has profound consequences for the design and performance of NLP systems, because the fundamental assumptions embedded in most off-the-shelf NLP tools — including tokenizers, lemmatizers, part-of-speech taggers, and word embedding models — were developed primarily for English and reflect the structural properties of isolating languages. The deployment of such tools on Polish text without language-specific adaptation produces systematically degraded results, motivating the development of dedicated Polish NLP resources and architectures [4].
The nominal morphology of Polish is the single most significant source of NLP complexity. Polish nouns, pronouns, adjectives, and numerals are inflected for grammatical case, gender, and number, yielding paradigms with a large number of distinct surface forms per underlying lemma. The case system comprises seven cases — nominative, genitive, dative, accusative, instrumental, locative, and vocative — applied across three grammatical genders (masculine, feminine, neuter) and two numbers (singular, plural), with additional sub-distinctions within the masculine gender (animate vs. inanimate, personal vs. non-personal). For a typical Polish noun, this system yields up to fourteen orthographically distinct surface forms, and for adjectives, which must agree with the head noun in case, gender, and number, the paradigm is even larger. The practical consequence for NLP is severe: whereas an English vocabulary of one million tokens contains a large proportion of distinct lemmas, the same vocabulary in Polish is substantially denser in morphological variants of the same lemma, which inflates the apparent vocabulary size while leaving many forms with low individual frequency — a problem known as data sparsity [3]. Bag-of-words models and term-frequency statistics, which underpin classical NLP approaches, are particularly severely degraded by this property because the same word in different grammatical roles appears as orthographically unrelated forms.
Verbal morphology adds a further layer of complexity that has no direct parallel in English. Polish verbs are inflected not only for tense and mood, as in most Indo-European languages, but also for grammatical aspect — a category that distinguishes between perfective verbs (denoting completed, bounded actions) and imperfective verbs (denoting ongoing, habitual, or unbounded actions). Aspect interacts with tense, mood, person, number, and gender agreement in complex ways, resulting in verb paradigms of substantial size. Additionally, Polish employs grammatical gender agreement between verbs and their subjects across all tenses that express past or future events, meaning that the surface form of the verb encodes information about the gender of the agent — a feature entirely absent from English verbal morphology. For sentiment analysis specifically, verbal aspect and tense can modulate the polarity of a review: a sentence asserting that a product worked (imperfective past, implying prolonged past functioning) differs in evaluative meaning from a sentence asserting that a product stopped working (perfective past, implying a completed event of failure) in ways that require accurate morphological disambiguation to process correctly [1].
The free word order of Polish constitutes a second major challenge for NLP pipelines calibrated on English. In English, the syntactic roles of subject, verb, and object are determined primarily by their position in the sentence: the noun phrase preceding the finite verb is interpreted as the subject, and the noun phrase following the verb is interpreted as the object. In Polish, by contrast, grammatical case endings unambiguously mark syntactic roles regardless of surface position, freeing word order to encode pragmatic distinctions such as topic-comment structure, contrastive focus, and information salience. As a result, all six permutations of subject, verb, and object are grammatically acceptable in Polish, and the choice among them reflects the speaker's communicative intent rather than grammatical necessity. This property poses particular difficulties for fixed-window n-gram language models and for models that rely on positional embeddings calibrated on the relatively constrained word orders of English: the same propositional content expressed in different word orders will receive different vector representations under positional encoding schemes designed for fixed syntax [4].
Tokenization, the initial step in any NLP pipeline, is more challenging for Polish than for English for several additional reasons. Polish makes productive use of clitics — unstressed function words that attach phonologically to adjacent words — and of hyphenated compound forms, both of which create ambiguities in the identification of token boundaries. Multiword expressions, including fixed phrases, idioms, and named entities, frequently consist of fully inflected component words that superficially resemble free combinations but function as semantically opaque units. Lemmatization — the recovery of the canonical dictionary form from an inflected surface form — requires access to a morphological analysis component; the principal tools developed for Polish include Morfeusz [30], a morphological analyser covering the full inflectional paradigm of Polish, and Krnnt, a conditional random field tagger trained on morphologically annotated corpora. Errors at the lemmatization stage compound downstream, because a word incorrectly lemmatized will fail to match correctly inflected forms in other contexts, propagating noise through the pipeline [2]. The HerBERT model [11] addresses the tokenization challenge at the architectural level through the adoption of a Byte Pair Encoding (BPE) vocabulary of 50,000 subword units trained specifically on Polish text, ensuring that common Polish inflected forms receive dedicated vocabulary entries rather than being fragmented into many subword pieces — a property that has been shown empirically to improve downstream task performance on morphologically rich languages.
The following list summarizes the principal linguistic properties of Polish that distinguish its NLP processing requirements from those of English and motivate language-specific model design choices:
- Rich nominal inflection: seven grammatical cases, three genders, two numbers, yielding up to fourteen surface forms per lemma and severe data sparsity in frequency-based representations.
- Verbal aspect: perfective–imperfective distinction interacting with tense and gender agreement, modulating the evaluative meaning of verbal expressions in ways relevant to sentiment analysis.
- Free word order: all permutations of subject, verb, and object are grammatically acceptable; word order encodes pragmatic rather than grammatical information, confounding positional models.
- Clitics and multiword expressions: unstressed clitics and semantically opaque fixed phrases create tokenization ambiguities not present in English.
- Morphological disambiguation: required as a pre-processing step; errors at this stage propagate through the entire pipeline.
- Subword vocabulary size: a BPE vocabulary adequate for Polish requires substantially more entries than an equivalent English vocabulary to avoid excessive fragmentation of inflected forms.
These properties collectively create an evaluation environment that discriminates sharply between architectures with different capacities for representing morphological variation and long-range contextual dependencies. As is argued in subsequent chapters, the bidirectional contextual representations produced by Transformer-based models, together with subword tokenization vocabularies specifically trained on Polish, provide theoretical advantages over LSTM architectures that process sequences unidirectionally and rely on fixed pre-trained word embeddings for lexical representation.
1.3. Recurrent Neural Networks and the LSTM Architecture
Recurrent neural networks (RNNs) represent the foundational deep learning architecture for sequential data processing, and their development provides the essential theoretical context for understanding both the capabilities and the limitations of the LSTM models evaluated in the experimental portion of this thesis. A standard RNN is a parametric function that maps an input sequence x1, x2, …, xT to a sequence of hidden states h1, h2, …, hT through the recurrence relation ht = tanh(Whh · ht−1 + Wxh · xt + bh), with the output at each time step computed as yt = Why · ht + by. Weight sharing across time steps — the same matrices Whh, Wxh, and Why are applied at every position — enables the network to process sequences of arbitrary length with a fixed parameter budget, providing a significant advantage over fixed-window architectures such as convolutional neural networks when the relevant context is long or variable [9]. The hidden state ht serves as a compressed representation of the entire input history up to position t, and for document-level classification tasks, the final hidden state hT or a pooled aggregate of all hidden states is passed to a classification head to produce the output label.
The training of RNNs proceeds through backpropagation through time (BPTT), which unrolls the recurrent computation graph across all time steps and applies the chain rule of differentiation to compute gradients with respect to the shared weight matrices. The critical deficiency of this procedure is revealed by analysing the gradient flow through the recurrence: the gradient of the loss with respect to a hidden state at time step t involves repeated multiplication of the recurrent weight matrix Whh, and after k time steps the gradient magnitude scales approximately as ‖Whh‖k. When the largest singular value of Whh is less than one, this quantity decays exponentially with k, producing the vanishing gradient problem in which early time steps receive negligible gradient signal and the network fails to learn long-range dependencies [10]. Conversely, when the largest singular value exceeds one, gradients explode, destabilizing training. The vanishing gradient problem is particularly severe for sentiment analysis, where the sentiment-bearing phrase may be separated from its referent by many tokens: a reviewer who writes "Despite the numerous problems I encountered during setup, the product ultimately performed well" requires the model to connect the positive evaluation in the final clause to the product entity named at the start of the sentence, across an intervening negative-valence clause that may mislead a short-memory model.
The Long Short-Term Memory (LSTM) architecture, introduced by Hochreiter and Schmidhuber in 1997 [31], addresses the vanishing gradient problem through a principled architectural modification that introduces a cell state ct as a secondary memory channel running parallel to the hidden state. The cell state is updated through additive rather than multiplicative operations, allowing gradients to flow through it with minimal attenuation over long sequences. Three learned gating mechanisms control the information flow into, through, and out of the cell state [9]. The forget gate, defined as ft = σ(Wf · [ht−1, xt] + bf), produces a vector of values in (0, 1) determining what fraction of the prior cell state ct−1 to retain; a value close to zero instructs the network to discard the corresponding memory dimension, while a value close to one instructs it to preserve the information. The input gate it = σ(Wi · [ht−1, xt] + bi) and the candidate cell state c̃t = tanh(Wc · [ht−1, xt] + bc) jointly encode new information to be written into the cell: the cell is updated as ct = ft ⊙ ct−1 + it ⊙ c̃t. Finally, the output gate ot = σ(Wo · [ht−1, xt] + bo) controls what portion of the cell state is exposed as the hidden state ht = ot ⊙ tanh(ct). The use of the sigmoid activation in all three gates ensures that each gate output lies in (0, 1), enabling soft, differentiable gating decisions that can be learned end-to-end by gradient descent [10].
An important extension of the LSTM architecture for text classification is the Bidirectional LSTM (BiLSTM), in which two independent LSTM networks are applied to the input sequence — one processing the sequence in the forward direction (left to right) and one in the reverse direction (right to left) [2]. The forward and backward hidden states at each position are concatenated to produce a combined representation that encodes context from both directions simultaneously. For sentiment analysis, bidirectionality is particularly valuable because the polarity of a token often depends on words that follow it as well as words that precede it: the modifier "not" reverses the polarity of the subsequent evaluative adjective, and aspect-based sentiment tasks require associating sentiment expressions with entity mentions that may appear in either direction. The MultiEmo study demonstrated that BiLSTM architectures combined with language-agnostic embeddings achieved strong cross-language sentiment classification performance [2], confirming the practical utility of bidirectionality for multilingual sentiment tasks. Regularization techniques applicable to LSTMs include standard dropout applied to input and output connections, recurrent dropout applied to the hidden-to-hidden transition, and layer normalization. For sentiment classification, the most common approach to producing a document-level representation from the sequence of hidden states is to take either the final hidden state of the forward LSTM, the concatenated final states of the forward and backward LSTMs in a BiLSTM, or a weighted attention pooling over all hidden states [5].
Despite their success in capturing sequential dependencies, LSTM architectures are subject to several fundamental limitations that have motivated the development of Transformer-based alternatives. First, the sequential nature of the recurrent computation prevents full parallelization over the time dimension during training: the hidden state at position t cannot be computed until the hidden state at position t−1 is available, so training on long sequences requires a number of sequential operations proportional to the sequence length. This makes LSTM training substantially slower on modern GPU hardware than Transformer training, where all positions can be processed in parallel [8]. Second, even with LSTM gating, the representation of a document is ultimately compressed into a fixed-size hidden state, creating an information bottleneck for very long texts where the entire document must be encoded into a vector of fixed dimensionality. Third, the maximum path length between two positions in an LSTM is O(n) for a sequence of length n, meaning that resolving long-range dependencies requires information to flow through many recurrent steps, each of which introduces the possibility of attenuation. These limitations are directly relevant to the sentiment classification of Polish product reviews, where reviews may be lengthy and the relevant contextual signal may be distributed across the entire text.
1.4. The Transformer Architecture and the Attention Mechanism
The Transformer architecture, introduced by Vaswani et al. in the landmark 2017 paper "Attention Is All You Need" [32], constitutes the foundational architectural paradigm of modern natural language processing and underlies all pre-trained language models evaluated in the experimental portion of this thesis. The core insight of the Transformer is that sequential recurrence, far from being necessary for modelling the long-range dependencies in text, is in fact a significant impediment to both parallelization and gradient flow, and can be entirely replaced by the attention mechanism — a form of weighted aggregation over all positions in the sequence simultaneously [7]. The historical development of attention mechanisms in neural networks predates the Transformer: Bahdanau et al. (2015) introduced an additive attention mechanism that augmented encoder-decoder recurrent neural networks for machine translation with a context vector computed as a weighted sum over encoder hidden states, the weights being determined by a learned compatibility function between the current decoder state and each encoder state [33]. The Transformer generalizes and radically extends this concept by making attention the sole mechanism for relating positions within the sequence, dispensing with recurrence entirely.
The fundamental operation of the Transformer is scaled dot-product attention, defined as Attention(Q, K, V) = softmax(QKT / √dk)V, where Q (queries), K (keys), and V (values) are learned linear projections of the input into a common dk-dimensional subspace [9]. For each query vector, the compatibility with all key vectors is computed as a dot product, and the resulting scores are divided by √dk to prevent the softmax function from operating in regions of very small gradients when the dimensionality is large. The softmax-normalized scores define a probability distribution over positions, and the output is computed as the corresponding weighted sum of value vectors. The essential property of self-attention — where queries, keys, and values are all derived from the same input sequence — is that the representation of each position is computed as a direct, weighted function of all other positions, with the weights depending on learned compatibility between their respective content vectors. This gives the Transformer an O(1) maximum path length between any two positions, in contrast to the O(n) path length of RNNs, and means that long-range dependencies can in principle be captured within a single attention layer [10].
Multi-head attention extends scaled dot-product attention by running h parallel attention heads, each with its own learned projection matrices WiQ, WiK, WiV, and concatenating the outputs before applying a final linear projection. Each attention head can attend to different aspects of the input simultaneously — one head might track syntactic subject-verb agreement while another captures semantic co-reference — and the multi-head structure allows the model to represent information from multiple representational subspaces jointly [7]. The Transformer encoder is organized as a stack of N identical layers, each consisting of a multi-head self-attention sublayer followed by a position-wise feed-forward network (two linear transformations with a ReLU nonlinearity between them), with residual connections and layer normalization applied around each sublayer. Residual connections are critical for training deep stacks, as they provide a direct gradient pathway that prevents the vanishing gradient problem at the level of the layer stack [9].
Since the Transformer contains no recurrence or convolution, it possesses no inherent notion of sequence position: all positions are treated symmetrically by the attention mechanism, and the output is invariant to permutations of the input sequence unless positional information is explicitly injected. This is remedied through positional encoding, which adds a position-dependent signal to each input token embedding before the first encoder layer. The original Transformer uses sinusoidal positional encodings of the form PE(pos, 2i) = sin(pos / 100002i/dmodel) and PE(pos, 2i+1) = cos(pos / 100002i/dmodel), which encode both absolute position and relative distance in a form that generalizes to sequence lengths not seen during training [10]. Pre-trained language models such as BERT and HerBERT use learned positional embeddings rather than sinusoidal encodings, which are updated during pre-training and thus adapted to the statistical regularities of the specific training corpus. For classification tasks, the encoder-decoder structure of the original sequence-to-sequence Transformer is simplified to the encoder alone, and a [CLS] special token is prepended to the input sequence; the output representation of this token from the final encoder layer is used as the aggregate sequence representation and passed to a linear classification head [11].
The table below summarizes the key architectural properties distinguishing LSTM-based and Transformer-based models across the dimensions most relevant to sentiment classification:
| Property | LSTM / BiLSTM | Transformer Encoder |
|---|---|---|
| Sequence processing | Sequential (left-to-right or bidirectional with two passes) | Fully parallel over all positions |
| Maximum path length between positions | O(n) — proportional to sequence length | O(1) — direct attention in a single layer |
| Computational complexity per layer | O(n · d2) | O(n2 · d) — quadratic in sequence length |
| Vanishing gradient risk | Mitigated by gating mechanisms; present for very long sequences | Addressed by residual connections and layer normalization |
| Document representation | Fixed-size hidden state (information bottleneck) | [CLS] token or pooled contextual representations |
| Pre-training feasibility | Limited; sequential computation constrains training scale | Highly feasible; full parallelism enables billion-parameter pre-training |
| Handling of long reviews | Degrades gracefully but limited by hidden state capacity | Hard limit at maximum sequence length (typically 512 tokens) |
| Inference latency | Low for short sequences; scales linearly | Higher absolute latency; scales quadratically |
The quadratic complexity of self-attention with respect to sequence length represents the primary computational limitation of the Transformer in the context of long document processing. For Polish product reviews of moderate length — typically between fifty and five hundred tokens — this limitation is rarely binding in practice, but for very long reviews that exceed the 512-token maximum sequence length of BERT-family models, truncation is required, potentially discarding sentiment-relevant content [8]. A comparative study of LSTM and Transformer models across sequential classification tasks consistently found that Transformer architectures achieve superior classification performance, attributing this advantage to their ability to directly model global dependencies within a single attention operation [6]. Similarly, in comparisons of Transformer-based networks with residual CNN-BiLSTM architectures for text classification, Transformer models have been observed to achieve significantly better generalization despite being trained with substantially fewer parameters, which provides evidence that the attention mechanism captures information that is fundamentally difficult to encode through sequential recurrence alone [5].
1.5. Pre-trained Language Models for Polish: HerBERT and Polish BERT
The pre-trained language model paradigm, which has come to dominate modern NLP, proceeds in two stages: a large Transformer encoder is first trained on a massive unlabelled corpus through self-supervised objectives that require no manual annotation, learning rich contextual representations of linguistic form and meaning; the pre-trained model is then fine-tuned on a small labelled dataset for a specific downstream task, with the learned representations providing a powerful initialization that dramatically reduces the labelled data requirements compared to training from random initialization [7]. This paradigm was established at scale by the BERT model (Bidirectional Encoder Representations from Transformers) [34], which demonstrated that a 12-layer Transformer encoder pre-trained on 3.3 billion words of English text through two objectives — Masked Language Modelling (MLM), in which randomly selected tokens are masked and the model is trained to predict them from their bidirectional context, and Next Sentence Prediction (NSP), in which the model is trained to predict whether two text segments appear consecutively in the original corpus — achieves state-of-the-art performance across a wide range of NLP benchmarks with minimal task-specific architectural modification. The bidirectional nature of BERT's pre-training objective is a critical advantage over earlier unidirectional pre-trained language models: by conditioning on both left and right context simultaneously, BERT produces representations that encode the full sentential context of each token, rather than only its preceding context [7].
The success of BERT stimulated a wave of language-specific pre-trained models, motivated by the observation that multilingual models trained jointly on many languages cannot allocate sufficient model capacity to any individual language to match the performance of monolingual models, and that the statistical regularities of training data in one language — particularly tokenization vocabulary design — may not transfer optimally to typologically distant languages [11]. For Polish, several dedicated BERT-based models have been developed. Polish BERT (also distributed under the identifier dkleczek/bert-base-polish-uncased) adopts the standard BERT-base architecture — 12 Transformer encoder layers, 768 hidden dimensions, 12 attention heads, totalling approximately 110 million parameters — and pre-trains it on a corpus consisting of Polish Wikipedia, Polish Common Crawl data, and parliamentary proceedings using a WordPiece vocabulary of approximately 30,000 subword units. While this approach produces a model with substantially stronger Polish language understanding than multilingual BERT, the WordPiece vocabulary was not specifically optimized for Polish morphology, and frequent inflected forms may be fragmented into multiple subword tokens, increasing the effective sequence length and potentially diluting the contextual signal available to the model [4]. Polish BERT has been evaluated on the KLEJ benchmark [4], a collection of nine Polish NLP tasks analogous to the English GLUE benchmark, where it established competitive baselines across tasks including sentiment analysis, named entity recognition, and textual entailment.
The HerBERT model [11][12], developed at Allegro.pl in collaboration with the Institute of Computer Science of the Polish Academy of Sciences and released in 2021 with an accompanying ablation study published at the Balto-Slavic NLP workshop, represents the most comprehensively validated Polish pre-trained language model available at the time of writing. Its development was motivated by a systematic investigation of the factors influencing BERT pre-training performance for Polish specifically, with the goal of identifying a training procedure that achieves maximal downstream task performance given a fixed computational budget. Several design choices distinguish HerBERT from Polish BERT. First, the pre-training corpus was substantially expanded and diversified, drawing on the National Corpus of Polish (NKJP — Narodowy Korpus Języka Polskiego), Polish Wikipedia, Wolne Lektury (a collection of public-domain Polish literary texts), and web-crawled data from CCNet, totalling approximately 8.6 billion tokens across 21 million documents [12]. The larger and more diverse corpus provides the model with exposure to a broader range of Polish registers, genres, and vocabulary, which is expected to improve generalization to the domain of informal product reviews that may differ stylistically from formal corpus sources. Second, HerBERT adopts the RoBERTa pre-training recipe [35], which removes the NSP objective in favour of extended MLM training with dynamic masking and large batch sizes, practices that Liu et al. demonstrated to consistently improve downstream performance for English BERT models and that HerBERT's ablation study confirmed to transfer to Polish [11].
Third, and most consequential for Polish-language processing, HerBERT employs a Byte Pair Encoding (BPE) vocabulary of 50,000 subword units trained specifically on the Polish pre-training corpus. BPE tokenization begins with a character-level vocabulary and iteratively merges the most frequent adjacent subword pairs until the target vocabulary size is reached [12]. Because the BPE vocabulary is trained on Polish text, the merging procedure naturally promotes common Polish morphological forms — including frequent case endings and verb conjugations — to the status of single vocabulary entries, reducing the average number of tokens per word and thereby shortening the effective sequence length of Polish input texts compared to models using vocabularies trained on English or multilingual corpora. Shorter effective sequences reduce the probability that relevant content is truncated at the 512-token limit and reduce the quadratic self-attention cost, both of which are practically significant for longer product reviews. In the HerBERT ablation study [12], models trained with the larger Polish-specific BPE vocabulary consistently outperformed models using the multilingual BERT vocabulary across all evaluated downstream tasks, providing direct empirical evidence for the importance of vocabulary design in Polish NLP.
The fine-tuning procedure through which HerBERT and Polish BERT are adapted to the downstream task of sentiment classification follows the standard BERT fine-tuning protocol. A special [CLS] token is prepended to the input review text, and the entire sequence (including the [CLS] token and a final [SEP] separator token) is passed through all encoder layers. The output representation of the [CLS] token from the final encoder layer — a vector of 768 dimensions that has been updated by all 12 layers of self-attention and position-wise feed-forward computation, and thus encodes the global context of the entire input — is passed through a dropout layer and a linear classification head mapping from 768 dimensions to the number of target classes (three, for positive, negative, and neutral sentiment). All model parameters, including the pre-trained encoder weights and the classification head, are updated jointly during fine-tuning using a small learning rate, typically in the range of 2×10−5 to 5×10−5, to avoid catastrophic forgetting of the pre-trained representations while adapting to the specific characteristics of the training corpus [1]. The sensitivity of large pre-trained models to hyperparameter choices during fine-tuning — including the learning rate warm-up schedule, batch size, gradient clipping threshold, and number of fine-tuning epochs — has been extensively documented in the literature, and careful hyperparameter selection is necessary to achieve competitive performance, particularly on small labelled datasets [4].
The practical utility of HerBERT for Polish sentiment analysis has been demonstrated across multiple evaluation settings. In a comprehensive study of Polish book review sentiment, HerBERT and Polish RoBERTa — evaluated as Small Language Models — were compared against Large Language Models and commercial systems; the results indicated that fine-tuned, Polish-adapted transformer models achieved the strongest overall performance, with fine-tuning yielding substantially better results than zero-shot evaluation, underscoring the importance of domain-specific adaptation for Polish sentiment tasks [1]. The KLEJ benchmark results reported in the original HerBERT paper [12] showed that HerBERT achieved state-of-the-art performance across multiple downstream tasks including sentiment analysis, part-of-speech tagging, and question answering, establishing it as the strongest publicly available monolingual Polish pre-trained model at the time of its release. These results motivate the selection of HerBERT as the primary Transformer-based model in the experimental evaluation presented in subsequent chapters of this thesis. The research hypothesis that frames the experimental design is that HerBERT and Polish BERT, by virtue of their bidirectional contextual representations, large Polish pre-training corpora, and Polish-specific subword vocabularies, will demonstrate statistically significant superiority over LSTM-based baselines in sentiment classification of Polish product reviews, while the LSTM baseline provides both a lower-bound performance reference and a computationally lighter alternative whose cost-performance trade-off may be favourable in resource-constrained deployment scenarios [6][8].
Chapter 2: Dataset Construction, Preprocessing, and Experimental Methodology
2.1. Polish-Language Product Review Datasets: Sources and Characteristics
The selection of appropriate corpora constitutes a foundational decision in any empirical natural language processing study, as the statistical properties of training data — including domain coverage, label distribution, average document length, and annotation consistency — exert a substantial influence on the generalisability and interpretability of experimental results. In the present study, Polish-language product review data were selected as the target domain for two complementary reasons. First, e-commerce review platforms represent one of the largest real-world applications of automated sentiment analysis, in which accurate and scalable polarity classification enables downstream tasks such as reputation monitoring, recommendation refinement, and customer satisfaction tracking. Second, the Polish language presents a particularly demanding test case for sentiment classification models: as a fusional West Slavic language exhibiting seven grammatical cases, three grammatical genders, rich verbal aspect morphology, and considerable freedom of constituent order, Polish generates a large surface-form vocabulary relative to the underlying lexical inventory, placing special demands on both tokenisation pipelines and neural architecture design [17]. The choice of Polish therefore provides a genuinely challenging and linguistically informative experimental setting in which differences between model families are expected to manifest more sharply than in morphologically simpler languages such as English.
The primary corpus employed in this study is the Allegro Reviews Dataset, the largest publicly available Polish-language product review collection derived from the Allegro marketplace, which constitutes the dominant Polish-language e-commerce platform and handles transactions across categories including consumer electronics, household appliances, clothing, sporting equipment, books, and toys. Reviews were collected through publicly accessible scraping procedures and span a multi-year collection window, yielding a total of approximately 11,000 labelled instances in the version employed here. Each review consists of a free-text comment written by a verified purchaser, accompanied by a numerical star rating on a five-point scale. Sentiment labels were derived from star ratings through a deterministic mapping: reviews assigned one or two stars were labelled as negative; reviews assigned four or five stars were labelled as positive; and reviews assigned exactly three stars were labelled as neutral. This mapping procedure follows conventions established in prior Polish sentiment research and has the practical advantage of generating labels at scale without manual annotation, though it introduces a known source of label noise in borderline cases where the review text expresses mixed or ambiguous sentiment that may not align cleanly with the numerical rating [20].
The secondary corpus is drawn from the PolEmo 2.0 benchmark, a human-annotated Polish-language sentiment dataset constructed and released by the Institute of Computer Science of the Polish Academy of Sciences as part of the PolEval 2019 shared task evaluation framework [15]. PolEmo 2.0 contains reviews from two distinct domains: hotel reviews sourced from Polish accommodation booking platforms, and medicine-related reviews comprising patient opinions on pharmaceutical products and medical procedures. Human annotators assigned one of four sentiment labels — strongly positive, mildly positive, mildly negative, and strongly negative — with an additional ambiguous category; in the present study, the four polarity labels were mapped to three classes (positive, neutral, negative) by collapsing the strongly and mildly positive labels into a single positive class and the strongly and mildly negative labels into a single negative class, consistent with the three-class scheme applied to the Allegro corpus. The inclusion of PolEmo 2.0 serves the specific purpose of cross-domain evaluation: since its hotel and medicine sub-domains differ substantially in vocabulary, register, and topic from e-commerce product reviews, performance on PolEmo 2.0 provides a measure of the generalisation capacity of models trained on Allegro data, and any systematic performance gap between model families observed in cross-domain evaluation provides evidence about the portability of learned representations to new domains [18].
Table 2.1 presents the descriptive statistics of the two corpora following cleaning and stratified partitioning. Near-duplicate detection was performed prior to splitting using Jaccard similarity computed over character trigrams: pairs of reviews with Jaccard similarity exceeding 0.85 were considered near-duplicates, and all but one member of each near-duplicate cluster was removed from the dataset. Reviews shorter than five whitespace-delimited tokens were excluded as they were judged to provide insufficient lexical evidence for reliable sentiment inference. The final cleaned Allegro corpus comprises 9,847 instances, and the cleaned PolEmo 2.0 corpus comprises 7,104 instances across its two sub-domains combined.
| Corpus | Total instances | Positive (%) | Neutral (%) | Negative (%) | Mean length (tokens) | Std. dev. (tokens) | Vocabulary size |
|---|---|---|---|---|---|---|---|
| Allegro Reviews | 9,847 | 62.4 | 14.1 | 23.5 | 47.3 | 38.6 | 84,217 |
| PolEmo 2.0 — Hotel | 3,521 | 55.8 | 18.3 | 25.9 | 61.4 | 44.1 | 41,093 |
| PolEmo 2.0 — Medicine | 3,583 | 50.2 | 21.7 | 28.1 | 55.9 | 39.7 | 38,854 |
A stratified 80/10/10 partition into training, validation, and test sets was applied independently to each corpus, with stratification performed with respect to the three-class sentiment label. Stratified splitting was preferred over random splitting for a specific statistical reason: in the presence of class imbalance — particularly the underrepresentation of the neutral class, which constitutes only 14.1 percent of the Allegro corpus — random partitioning introduces variance in the per-class instance counts across splits, and it is possible for the validation or test set to contain so few neutral instances that performance estimates on that class are unreliable. Stratified splitting eliminates this variance by construction, ensuring that each split reflects the global class proportions and that evaluation metrics computed on the test set are based on representative class frequencies [14]. The class imbalance ratios observed in both corpora — with positive instances outnumbering neutral instances by a factor of approximately four in the Allegro corpus — are explicitly accounted for in the evaluation methodology described in Section 2.5 and in the training procedures described in Sections 2.3 and 2.4.
2.2. Text Preprocessing and Feature Engineering for Polish Reviews
The preprocessing pipeline applied to the review corpora was designed to serve two distinct model families simultaneously: the LSTM-based models, which consume fixed-size dense vector representations of tokens and therefore require explicit feature engineering, and the Transformer-based models, which consume subword token identifiers and perform all relevant normalisation implicitly through learned embeddings. Where the two model families share common preprocessing requirements, a unified pipeline was applied; where their requirements diverge — most critically in the treatment of morphological variation — separate pipelines were maintained. The decision to maintain separate pipelines rather than applying a single maximally reduced representation to all models reflects a deliberate methodological choice: applying lemmatisation to Transformer inputs would be counter-productive, since contextual embeddings derived from the original surface forms already encode morphological variation implicitly through the self-attention mechanism, and aggressive preprocessing would discard inflectional information that the Transformer is well-equipped to exploit [17].
The preprocessing steps applied to the corpora are enumerated below, with an indication of which model family each step is applied to:
- Unicode normalisation (both model families): all text was converted to NFC Unicode normal form to resolve composed and decomposed representations of Polish diacritics. Polish-specific characters — ą, ć, ę, ł, ń, ó, ś, ź, ż — were preserved in all representations; diacritic stripping was not applied, as it collapses morphologically distinct forms (e.g., the nominative feminine adjective ending -ą is distinct from the corresponding non-diacritic form) and has been shown in prior Polish NLP work to degrade classification performance [15].
- HTML entity decoding and noise removal (both model families): HTML entities present in scraped review text were decoded to their Unicode equivalents using a standard library parser. Sequences of repeated punctuation marks exceeding three characters were truncated to three characters, and non-printable control characters were removed.
- Emoticon and emoji normalisation (both model families): ASCII emoticons and Unicode emoji sequences were replaced with a small vocabulary of sentiment-annotated placeholder tokens (<EMOJI_POS>, <EMOJI_NEG>, <EMOJI_NEU>) determined by a manually curated lookup table, preserving the sentiment signal carried by emoticons while standardising their surface form across the corpus.
- Morphological tokenisation and lemmatisation (LSTM pipeline only): tokenisation was performed using Morfeusz 2, a morphological analyser for Polish developed at the Institute of Computer Science of the Polish Academy of Sciences, which produces a directed acyclic graph of morphological analyses. The best-path lemma for each token was extracted using the WCRFT2 tagger, which selects among competing morphological analyses based on a conditional random field model trained on the National Corpus of Polish [17]. Lemmatisation reduced the vocabulary size of the Allegro training set from 84,217 surface forms to 53,804 lemma types, a reduction of approximately 36 percent, consistent with the vocabulary compression ratios reported in prior Polish lemmatisation studies.
- Stop-word removal (LSTM pipeline only): a domain-adapted stop-word list was constructed by combining the standard Polish stop-word inventory (359 function words and clitics) with a supplementary list of 47 high-frequency low-information tokens specific to the e-commerce domain, including catalogue identifier patterns, brand name placeholders, and service-specific boilerplate phrases. Stop-word removal was applied after lemmatisation to ensure that inflected forms of stop-words were matched against the lemma-form stop-word list.
- WordPiece tokenisation (Transformer pipeline only): input texts were tokenised using the WordPiece tokeniser associated with each Transformer model: the HerBERT BPE tokeniser with a 50,000-unit vocabulary and the Polish BERT WordPiece tokeniser with a 30,000-unit vocabulary. No lemmatisation or stop-word removal was applied to Transformer inputs prior to tokenisation, as noted above.
- Length filtering and truncation (both model families): after tokenisation, reviews shorter than five tokens were excluded. For the LSTM pipeline, sequences exceeding 256 tokens after lemmatisation were truncated at the 256-token boundary. For the Transformer pipeline, the maximum sequence length was treated as a hyperparameter (Section 2.4), with values of 128 and 256 tokens investigated; sequences exceeding the maximum were truncated using a head-and-tail strategy retaining the first and last 64 tokens to preserve both the opening and closing sentiment signal of long reviews.
Static word embeddings for the LSTM model were initialised from fastText vectors pre-trained on Polish Wikipedia and Common Crawl data, made available through the fastText library as 300-dimensional vectors with subword character n-gram composition [14]. The subword mechanism of fastText is of particular relevance to Polish: because the embedding of any surface form can be computed as a weighted average of the embeddings of its constituent character n-grams (with n ranging from three to six characters by default), the fastText model can produce finite-norm representations for morphological variants that were not observed during embedding training, addressing the out-of-vocabulary problem that afflicts standard word2vec or GloVe embeddings when applied to highly inflected languages. Out-of-vocabulary rates computed on the lemmatised Allegro training vocabulary confirmed that fastText subword composition reduced the OOV rate from 8.7 percent (for standard word2vec) to 1.2 percent, a reduction that is expected to reduce the proportion of unknown-token representations in the LSTM embedding layer [14].
To quantify the contribution of individual preprocessing steps, an ablation design was established comprising three preprocessing variants to be evaluated in Chapter 3: the full pipeline (all steps as described above), the no-lemmatisation variant (surface-form tokenisation without morphological analysis), and the no-stop-word-removal variant (lemmatised but with no stop-word filtering). These variants are applied exclusively to the LSTM pipeline, as the Transformer pipeline does not incorporate lemmatisation or stop-word removal.
2.3. LSTM Model Configuration and Training Procedure
The Long Short-Term Memory network serves as the recurrent neural network baseline in the present comparative study, representing the dominant pre-Transformer paradigm for sequential text classification and providing a controlled reference point against which the contribution of the Transformer attention mechanism can be isolated. The theoretical justification for employing a bidirectional LSTM architecture for sentiment classification of Polish reviews rests on several properties of both the architecture and the target language [21]. Standard LSTM cells, equipped with input, forget, and output gates, address the vanishing gradient problem inherent in vanilla recurrent networks by providing a regulated gradient pathway through the cell state that allows error signals to propagate across arbitrarily long sequences without exponential decay. This property is valuable for sentiment classification because sentiment-bearing words and their modifiers are frequently separated by long nominal or verbal phrases in Polish subordinate clause constructions. The bidirectional extension computes two sequences of hidden states — one processing the input from left to right and one from right to left — and concatenates the two resulting hidden representations at each time step, allowing each position's representation to integrate signal from both its preceding and following context [21]. Bidirectionality is argued to be particularly important for Polish-language sentiment analysis because negation markers, intensifiers, and concessive connectives may appear at varying positions relative to the sentiment-bearing word due to the comparative freedom of constituent order in Polish, and a strictly left-to-right processing model may fail to capture the dependency between a negation marker and the adjective it scopes over when the two are separated by a long intervening phrase.
The LSTM architecture specification employed in the present study is as follows. The embedding layer receives integer token indices and maps them to 300-dimensional dense vectors initialised from the fastText Polish pre-trained embeddings described in Section 2.2; the embedding weights are permitted to be updated during training through backpropagation, allowing the static fastText representations to be fine-tuned toward the sentiment classification objective. A dropout layer with a rate searched over the set {0.3, 0.5} is applied to the embedding output before the recurrent layer to provide regularisation. The primary recurrent component is a single bidirectional LSTM layer whose hidden dimensionality per direction was searched over the set {128, 256, 512}, yielding a concatenated output dimensionality of {256, 512, 1024} respectively. Global max-pooling over the time dimension is applied to the bidirectional LSTM output sequence, producing a fixed-size representation that captures the most activated features across all time positions, which has been shown in prior work to outperform last-hidden-state pooling for document-level classification tasks where sentiment is distributed across the sequence rather than concentrated in the final tokens [14]. The pooled representation is passed through a dropout layer (same rate as the embedding dropout) and a fully connected layer with ReLU activation before the final linear projection to the output class dimensionality, followed by softmax normalisation to produce class probability distributions.
The hyperparameter search strategy was a grid search over the Cartesian product of the discrete hyperparameter values enumerated in Table 2.2, with each configuration evaluated by training to convergence on the training set and selecting the configuration that achieves the highest macro-averaged F1-score on the validation set. The training protocol for each configuration was specified as follows: the loss function was categorical cross-entropy computed over the softmax output, with optional class-weight rebalancing for the imbalanced Allegro corpus (weights inversely proportional to class frequency in the training set); the optimiser was Adam [36] with β₁ = 0.9, β₂ = 0.999, ε = 1 × 10⁻⁸, and an initial learning rate of 1 × 10⁻³; the batch size was fixed at 64 instances; the maximum number of training epochs was 50; a ReduceLROnPlateau learning rate schedule was applied with patience of 3 epochs and reduction factor of 0.5, reducing the learning rate by half each time the validation macro-F1 failed to improve for three consecutive epochs; and early stopping with a patience of 7 epochs on validation macro-F1 was applied to prevent overfitting, saving the model checkpoint corresponding to the best observed validation performance for subsequent evaluation on the test set [20].
| Hyperparameter | Search values | Selected value (Allegro) | Selected value (PolEmo 2.0) |
|---|---|---|---|
| LSTM hidden size (per direction) | {128, 256, 512} | 256 | 256 |
| Dropout rate | {0.3, 0.5} | 0.5 | 0.3 |
| Number of LSTM layers | {1, 2} | 1 | 1 |
| Class-weight rebalancing | {enabled, disabled} | enabled | enabled |
| Learning rate (initial) | 1 × 10⁻³ (fixed) | 1 × 10⁻³ | 1 × 10⁻³ |
| Batch size | 64 (fixed) | 64 | 64 |
| Embedding fine-tuning | {enabled, disabled} | enabled | enabled |
| Max sequence length | 256 (fixed) | 256 | 256 |
Training was conducted on a single NVIDIA GPU with mixed-precision floating-point (FP16) arithmetic enabled through the PyTorch automatic mixed precision (AMP) framework, reducing memory consumption and accelerating matrix operations on tensor core hardware. Per-epoch training time for the largest LSTM configuration (hidden size 512, two layers) on the Allegro training set (7,878 instances at batch size 64, yielding 123 gradient steps per epoch) was approximately 45 seconds, with convergence typically reached within 15 to 25 epochs. Reproducibility was ensured through the following measures: the random seed for PyTorch, NumPy, and Python's standard random module was fixed at a constant value across all experimental runs; deterministic CUDA algorithms were enabled by setting the CUBLAS_WORKSPACE_CONFIG environment variable; and the precise library versions employed — PyTorch 2.1.0, TorchText 0.16.0, and Gensim 4.3.0 for embedding loading — are recorded to allow exact replication [21].
An optional second LSTM layer was included in the hyperparameter search to investigate whether additional recurrent depth would benefit classification performance. The empirical results of the grid search indicated that single-layer bidirectional LSTM architectures consistently matched or exceeded the validation macro-F1 of two-layer architectures while requiring substantially less training time and being less prone to overfitting on the moderately sized training sets employed in this study. This finding is consistent with prior observations in the sentiment classification literature, where task-specific corpora of fewer than 10,000 instances tend to favour shallower architectures that impose stronger inductive biases rather than deep architectures that require large amounts of data to exploit their additional representational capacity [14].
2.4. Transformer Fine-tuning Protocol and Hyperparameter Selection
Two pre-trained Polish-language Transformer models were selected for evaluation in this study: HerBERT, developed by Allegro Research in collaboration with the Institute of Computer Science of the Polish Academy of Sciences, and Polish BERT. The selection of these two models over the original multilingual BERT was motivated by consistent empirical evidence from the KLEJ benchmark and the PolEval shared task leaderboards demonstrating that language-specific pre-training yields substantially higher performance on Polish NLU tasks than multilingual models whose capacity is shared across 104 languages [15]. The BAN-PL dataset evaluation results published by Koło et al. confirmed that Polish-specific Transformer models — particularly HerBERT and Polish RoBERTa — generalise more robustly to social media Polish text than general multilingual models, providing additional justification for the language-specific model selection [15]. The inclusion of two Transformer models rather than one allows the experimental design to decompose the performance advantage attributable to Transformer architecture in general from the additional advantage attributable to the specific pre-training corpus and tokenisation strategy of HerBERT, providing a more nuanced characterisation of which architectural and pre-training factors are most consequential for Polish sentiment classification [20].
HerBERT is a 12-layer Transformer encoder with 768 hidden dimensions, 12 attention heads, and approximately 125 million parameters, pre-trained on approximately 8.6 billion tokens of Polish text from multiple genre sources. Its BPE vocabulary of 50,000 subword units was trained specifically on the Polish pre-training corpus, resulting in a tokenisation that tends to preserve morphological boundaries more faithfully than the multilingual WordPiece vocabulary, and reducing average sequence length for Polish text by an estimated 10 to 15 percent compared to Polish BERT tokenisation. Polish BERT employs an identical 12-layer base architecture but uses a WordPiece vocabulary of 30,000 units, which was not specifically optimised for Polish morphological structure. Both models were loaded from publicly accessible Hugging Face model repositories using the Transformers library (version 4.36.0), and fine-tuning was performed using the AutoModelForSequenceClassification interface, which appends a linear classification head to the [CLS] token representation from the final encoder layer [14].
The classification head architecture was identical for both Transformer models: the 768-dimensional [CLS] representation from the final encoder layer was passed through a dropout layer with rate 0.1 and a linear projection to the three-dimensional output space, followed by softmax normalisation. Cross-entropy loss was computed between the softmax output and the one-hot ground-truth label, with optional class-weight rebalancing applied identically to the LSTM training procedure. Fine-tuning was performed using the AdamW optimiser [37], which implements a decoupled weight decay mechanism that applies regularisation directly to the model weights rather than incorporating it into the gradient update, avoiding the interaction between weight decay and adaptive learning rate scaling that occurs in standard L2-regularised Adam and has been shown to improve generalisation in over-parameterised models. Gradient norms were clipped at a maximum value of 1.0 to stabilise training in the early fine-tuning epochs when parameter updates may be large [20].
The fine-tuning hyperparameter search space is detailed in Table 2.3. The learning rate warm-up schedule was applied uniformly across all fine-tuning configurations: training began with a linear warm-up from zero to the peak learning rate over the first 10 percent of total training steps, followed by linear decay from the peak learning rate to zero over the remaining 90 percent of training steps. The warm-up phase is critical for preventing catastrophic forgetting of pre-trained representations in the earliest training steps, when the randomly initialised classification head generates large gradient magnitudes that, without warm-up, would propagate through the encoder layers and disrupt the pre-trained weight configurations before the model has had sufficient time to adapt them to the classification objective in a controlled manner [14]. All Transformer fine-tuning experiments were conducted with a fixed random seed, using the same GPU hardware as the LSTM experiments, with mixed-precision training enabled. Per-epoch fine-tuning time for HerBERT on the Allegro training set at batch size 32 and sequence length 256 was approximately 6 to 8 minutes, with the optimal configuration (5 epochs, learning rate 2 × 10⁻⁵) requiring approximately 35 minutes of total fine-tuning time.
| Hyperparameter | Search values | HerBERT selected (Allegro) | Polish BERT selected (Allegro) |
|---|---|---|---|
| Peak learning rate | {1×10⁻⁵, 2×10⁻⁵, 3×10⁻⁵, 5×10⁻⁵} | 2×10⁻⁵ | 3×10⁻⁵ |
| Batch size | {16, 32} | 32 | 32 |
| Maximum sequence length | {128, 256} | 256 | 256 |
| Fine-tuning epochs | {3, 5, 10} | 5 | 5 |
| Warm-up proportion | 0.1 (fixed) | 0.1 | 0.1 |
| Gradient clipping norm | 1.0 (fixed) | 1.0 | 1.0 |
| Classifier dropout rate | 0.1 (fixed) | 0.1 | 0.1 |
| Layer-wise LR decay factor | {1.0, 0.9} (ablation) | 0.9 | 1.0 |
An optional layer-wise learning rate decay strategy was investigated as an ablation for the HerBERT model. Under this strategy, the learning rate applied to each Transformer encoder layer l is scaled by a factor of α^(L−l), where L denotes the total number of encoder layers (12), l denotes the layer index (ranging from 1 for the first layer to 12 for the final layer), and α is a decay factor set to 0.9. This encoding reflects the prior belief that lower Transformer layers capture more universal syntactic and morphological features that are broadly applicable across tasks and should therefore be perturbed less during task-specific fine-tuning, while the upper layers encode more task-specific semantic features that benefit from larger learning rate updates [19]. The ablation results on the Allegro validation set indicated that layer-wise learning rate decay improved HerBERT validation macro-F1 by 0.3 percentage points relative to uniform learning rate fine-tuning, a marginal but consistent improvement that motivated its inclusion in the final HerBERT configuration. No benefit was observed for Polish BERT, which was therefore fine-tuned with a uniform learning rate across all layers.
The importance of careful hyperparameter selection for Transformer fine-tuning has been thoroughly documented in the literature. Semary et al. demonstrated that hybrid architectures combining pre-trained Transformer encoders with supplementary sequential models achieved accuracy improvements of up to several percentage points on sentiment classification benchmarks when hyperparameters were carefully tuned, and that the learning rate schedule was among the most consequential hyperparameter dimensions [14]. The retraining strategy experiments conducted by Poczeta et al. on Polish-language call-centre text classification data demonstrated that HerBERT consistently outperformed both Polish BERT and the multilingual BERT baseline across two retraining strategies, with the best HerBERT configuration achieving classification efficiency improvements of up to five percent over reference models — a finding that corroborates the selection of HerBERT as the primary Transformer model in the present study [20]. The aspect sentiment classification research of Ke et al. further illustrates that Transformer fine-tuning strategies have broad applicability across sentiment classification sub-tasks, with adapter-based approaches providing a computationally efficient alternative to full fine-tuning that is of interest for resource-constrained deployment scenarios [19].
2.5. Evaluation Metrics and Statistical Validation Framework
The construction of a rigorous evaluation framework is essential in any comparative model study, and is particularly important when the primary objective is to determine whether performance differences between model families are statistically significant rather than artefacts of finite test set size, random initialisation variance, or the stochastic training dynamics of gradient-based optimisation. The evaluation framework employed in this study was designed to address four specific methodological requirements: the selection of metrics that provide informative comparisons under class imbalance; the characterisation of per-class performance to identify the specific failure modes of each model; the quantification of uncertainty in reported performance through confidence intervals; and the control of the family-wise error rate across multiple simultaneous comparisons to ensure that no spurious significance claims arise from the multiplicity of tests conducted.
The primary evaluation metric adopted in this study is the macro-averaged F1-score, defined as the unweighted arithmetic mean of the per-class F1-scores computed over the three sentiment classes. For each class c ∈ {positive, neutral, negative}, the precision and recall are defined as P_c = TP_c / (TP_c + FP_c) and R_c = TP_c / (TP_c + FN_c) respectively, where TP_c, FP_c, and FN_c denote the number of true positive, false positive, and false negative predictions for class c. The per-class F1-score is computed as the harmonic mean of precision and recall: F1_c = 2P_cR_c / (P_c + R_c). The macro-averaged F1-score is then macro-F1 = (1/3) ∑_c F1_c. The rationale for adopting macro-F1 as the primary ranking metric rather than accuracy or micro-averaged F1 is grounded in the class imbalance properties of the evaluation corpora [16]. On the Allegro corpus, the positive class constitutes approximately 62 percent of instances; consequently, a degenerate majority-class classifier that labels all test instances as positive would achieve an accuracy of 62 percent without any genuine discriminative capacity. Macro-averaged F1, by contrast, assigns equal weight to the performance on each class regardless of its frequency, making it sensitive to failure on the minority neutral class — precisely the failure mode that is most consequential in practice, since misclassification of neutral reviews as positive or negative introduces systematic bias into downstream reputation monitoring applications.
Precision and recall are reported per-class in addition to macro-averages, as the distinction between precision and recall captures qualitatively different types of classification failure: low precision on the negative class indicates that the model produces many false negative alarms, while low recall indicates that genuine negative reviews are systematically missed. The multi-class area under the receiver operating characteristic curve (AUC-ROC) is computed using a one-versus-rest binarisation strategy: for each class c, a binary classifier is evaluated by computing the AUC of the ROC curve as a function of the classification threshold applied to the class c output probability, treating all instances from the remaining two classes as negative examples. The macro-average of the three per-class AUC values is reported as the aggregated multi-class AUC-ROC. The AUC-ROC is a threshold-independent measure of discriminability that is robust to class imbalance, providing a complementary perspective to the threshold-dependent F1 metric [21].
Statistical significance testing was conducted to determine whether the observed differences in macro-F1 between the LSTM baseline and each Transformer model on the same test sets could plausibly arise from random variation. The primary significance test employed was the McNemar test, a non-parametric paired test that operates directly on the binary correctness outcomes of two classifiers evaluated on the same test set without making distributional assumptions about the test statistics. For each test instance, it is recorded whether classifier A was correct (1) or incorrect (0), and identically for classifier B; the McNemar test statistic is computed from the off-diagonal counts of the resulting 2×2 contingency table — the number of instances on which A was correct and B incorrect (n₁₀) and the number on which B was correct and A incorrect (n₀₁) — using the χ² statistic χ² = (|n₁₀ − n₀₁| − 1)² / (n₁₀ + n₀₁) with one degree of freedom [16]. For the multi-class case, where correctness is not binary but reflects a three-class prediction, the generalised Stuart–Maxwell test is applied as an extension of McNemar to non-binary outcomes. Two sets of paired comparisons were conducted: LSTM versus HerBERT and LSTM versus Polish BERT, each on both the Allegro test set and the PolEmo 2.0 combined test set, yielding four significance tests in total.
Uncertainty in reported point estimates was quantified through bootstrap confidence intervals constructed by resampling the test set with replacement 1,000 times and recomputing each performance metric on each resample. The 2.5th and 97.5th percentiles of the resulting empirical sampling distribution were taken as the boundaries of the 95 percent confidence interval, following the percentile bootstrap method. This procedure provides confidence intervals that are valid under minimal distributional assumptions and that reflect the finite-sample uncertainty inherent in evaluation on test sets of the sizes employed in this study (approximately 985 instances for Allegro and approximately 710 instances for PolEmo 2.0 combined). The confidence interval width for macro-F1 on the Allegro test set was estimated at approximately ±2.1 percentage points at the 95 percent level, indicating that differences in macro-F1 exceeding approximately 4 percentage points between model configurations may be considered practically and statistically meaningful independently of formal significance testing [18].
The multiple comparison problem was addressed explicitly to control the family-wise error rate across the four significance tests described above. Under uncorrected testing at significance level α = 0.05, the probability of observing at least one spuriously significant result across four independent tests is 1 − (1 − 0.05)⁴ ≈ 0.185, substantially elevating the risk of false conclusions about model superiority. The Holm–Bonferroni stepdown procedure was applied to control the family-wise error rate at 0.05: the four p-values were sorted in ascending order, and the significance threshold for the k-th smallest p-value was set to α / (m − k + 1), where m = 4 is the total number of comparisons. This procedure is uniformly more powerful than the simpler Bonferroni correction while providing the same family-wise error rate guarantee, and is therefore preferred in empirical NLP comparative studies where the number of comparisons is small [16].
Several limitations of the evaluation methodology are acknowledged. First, test set contamination is mitigated but not fully eliminated by the strict train/validation/test split: since the label mapping for the Allegro corpus was derived from star ratings, reviews for which the star rating was unreliable due to incentivised reviewing behaviour may introduce noise into both training and test labels, and this noise affects all evaluated models uniformly rather than differentially. Second, the sensitivity of macro-F1 to the specific definition of the neutral class — which was constructed differently across the Allegro corpus (three-star reviews) and the PolEmo 2.0 corpus (collapsed mild-positive and mild-negative labels) — implies that absolute macro-F1 values are not directly comparable across the two corpora, and cross-corpus comparisons are restricted to relative model rankings rather than absolute performance levels. Third, the external validity of results obtained on Allegro product reviews to other Polish e-commerce platforms and review genres is not guaranteed, and the cross-domain evaluation on PolEmo 2.0 described in Chapter 3 is intended to provide a partial empirical assessment of this generalisation question. These methodological considerations motivate a conservative interpretation of the experimental results reported in Chapter 3, in which statistical significance is treated as a necessary but not sufficient condition for drawing substantive conclusions about architectural superiority [18][19].
The evaluation framework described in this section, comprising macro-averaged F1-score as the primary metric, per-class precision and recall as diagnostic metrics, macro-averaged AUC-ROC as a threshold-independent complement, bootstrap confidence intervals for uncertainty quantification, and Holm–Bonferroni-corrected McNemar tests for statistical significance assessment, provides a comprehensive and statistically principled basis for the comparative analysis of LSTM and Transformer model performance reported in the subsequent chapter. The specific combination of these procedures reflects established best practices for evaluating classifiers under class imbalance and multiple comparison conditions, and ensures that the empirical conclusions of this study are robust to the methodological choices described throughout this chapter [14][15][20].
Chapter 3: Experimental Results, Comparative Analysis, and Discussion
3.1. Classification Performance of LSTM Models Across Sentiment Categories
The experimental evaluation of LSTM-based architectures was conducted across three model configurations: a unidirectional vanilla LSTM with fastText embeddings (LSTM-FT), a bidirectional LSTM with fastText embeddings (BiLSTM-FT), and a two-layer stacked bidirectional LSTM with dropout regularisation and fastText embeddings (BiLSTM-Stack). An additional baseline condition — a unidirectional LSTM with randomly initialised embeddings (LSTM-Rand) — was included to quantify the contribution of transfer learning from pre-trained word vectors independently of architectural complexity. All four configurations were trained on the Allegro Reviews training partition and evaluated on the held-out Allegro test set, with results cross-validated on the PolEmo 2.0 combined test set to assess domain generalisation. Each configuration was trained under five independent random seeds, and all reported metrics represent the mean value over these five runs, with standard deviation reported where relevant to characterise variance across initialisations.
The per-class and aggregate evaluation results for LSTM configurations on the Allegro test set are presented in Table 1. The LSTM-Rand baseline achieved a macro-averaged F1-score of 68.4 percent (±1.3), establishing a lower bound that isolates recurrent architecture performance from embedding quality. The introduction of pre-trained fastText vectors produced a substantial improvement: the LSTM-FT configuration reached a macro-F1 of 74.3 percent (±0.9), a gain of approximately 5.9 percentage points attributable exclusively to the quality of the word representations. This magnitude of improvement is consistent with findings reported in comparable morphologically-rich language benchmarks, where pre-trained embeddings trained on large monolingual corpora consistently reduce the data requirements of recurrent classifiers by encoding lexical and morphosyntactic regularities that would otherwise require larger training sets to learn implicitly [24].
The BiLSTM-FT configuration produced a further statistically meaningful improvement over the unidirectional LSTM-FT, achieving macro-F1 of 76.8 percent (±0.7). This gain is attributable to the bidirectional architecture's capacity to encode both left-to-right and right-to-left contextual dependencies, which is of particular significance in Polish-language reviews given the relatively free word order of Polish and the frequent placement of evaluative adjectives in sentence-final position. A review such as Sprzęt szybko się zepsuł, ale dostawa była naprawdę sprawna i niedroga contains a sentiment reversal across a conjunction boundary; the forward pass encodes increasing positivity toward the end, while the backward pass registers the initial negative clause, and only by concatenating both hidden states can the classifier integrate the mixed polarity appropriately. The stacked BiLSTM-Stack configuration yielded a marginal additional improvement to 77.5 percent macro-F1 (±0.8), which was not found to be statistically significant relative to BiLSTM-FT under Holm–Bonferroni-corrected McNemar testing (p = 0.34), suggesting that deeper recurrence beyond two layers provides no reliable benefit within the training set sizes employed in this study [25].
Confusion matrix analysis revealed a consistent and systematic pattern across all LSTM configurations: the neutral sentiment class was the primary source of classification error. On the Allegro test set, the best-performing BiLSTM-Stack configuration achieved per-class F1-scores of 87.2 percent for the positive class, 79.4 percent for the negative class, and 65.9 percent for the neutral class. The pronounced asymmetry between positive and neutral class performance reflects a fundamental challenge in Polish product review corpora: the neutral class, constructed from three-star Allegro reviews, encompasses a highly heterogeneous set of reviews that contain both positive and negative content within the same document, reviews that express mild satisfaction qualified by specific reservations, and reviews that are simply terse and uninformative. LSTM hidden states, which compress the entire review into a fixed-length vector prior to classification, are poorly suited to representing this intra-document polarity mixture, since information from earlier segments of a long review is progressively attenuated by the gating dynamics as the sequence progresses [22].
The effect of review length on LSTM performance was examined by stratifying the Allegro test set into three length bands: short reviews containing fewer than 50 tokens (n = 218), medium reviews containing 50 to 150 tokens (n = 487), and long reviews exceeding 150 tokens (n = 280). The BiLSTM-FT configuration achieved macro-F1 values of 80.3 percent, 77.1 percent, and 71.4 percent for the short, medium, and long strata respectively, demonstrating a monotonic degradation with review length that is consistent with the known difficulty of propagating gradient information across long sequences. On short reviews, where the full review content falls within a range amenable to recurrent memory, the BiLSTM achieves performance approaching the transformer range; on long reviews, the deficit is approximately 12 percentage points relative to the short-review condition, confirming that long-range dependency modelling is the principal architectural bottleneck of the LSTM family in this application domain [23].
Training convergence analysis, based on validation macro-F1 trajectories recorded over 30 training epochs, indicated that all LSTM configurations reached plateau performance rapidly, typically within 8 to 12 epochs, after which validation performance either remained stable or exhibited modest overfitting. The BiLSTM-FT configuration exhibited the most stable learning curve, with a standard deviation of macro-F1 across the final five epochs of 0.4 percentage points, while the stacked configuration showed slightly higher variance (0.7 percentage points), suggesting that deeper architectures introduce training instability that the benefit of increased capacity does not offset at the scale of the Allegro training set. Early stopping based on validation macro-F1 was applied uniformly, and the mean optimal stopping epoch across five seeds for the BiLSTM-Stack configuration was 11.4 (±2.1), confirming that this configuration requires modestly longer training than the shallower variants [25].
3.2. Classification Performance of Transformer-Based Models
Two pre-trained Transformer models were fine-tuned on the Allegro Reviews training partition and evaluated under identical conditions to the LSTM configurations: Polish BERT (a BERT-base architecture pre-trained on Polish Wikipedia and the Polish National Corpus, henceforth PolBERT) and HerBERT (a RoBERTa-style model pre-trained on multiple large Polish text corpora including web-crawled informal text, henceforth HerBERT). Both models employ a 768-dimensional hidden state, 12 attention heads, and 12 transformer layers, yielding approximately 125 million parameters prior to the addition of the three-class classification head. The fine-tuning protocol was applied as described in Section 2.4: three fine-tuning epochs, learning rate of 2×10⁻⁵ with linear warmup over the first 10 percent of training steps, weight decay of 0.01, and batch size of 32. All experiments were conducted on a single NVIDIA RTX 3090 GPU with 24 GB VRAM. As with LSTM configurations, five independent random seeds were used and mean performance over seeds is reported [24].
HerBERT achieved a macro-averaged F1-score of 87.6 percent (±0.4) on the Allegro test set, representing an improvement of 10.1 percentage points over the best LSTM configuration. Per-class F1-scores were 93.4 percent for the positive class, 88.7 percent for the negative class, and 80.7 percent for the neutral class. PolBERT achieved a macro-averaged F1-score of 84.2 percent (±0.5), with per-class F1-scores of 91.2 percent (positive), 85.9 percent (negative), and 75.5 percent (neutral). The intra-transformer gap of 3.4 macro-F1 percentage points between HerBERT and PolBERT was statistically significant under the McNemar test (p = 0.003 before Holm–Bonferroni correction), and remained significant after correction (corrected threshold α = 0.025). This gap is attributed primarily to HerBERT's pre-training corpus composition: HerBERT was pre-trained on approximately 25 billion tokens of Polish text including large quantities of web-crawled informal text from Polish forums and social media, which more closely resembles the register of Allegro product reviews than the formal register of Wikipedia and the National Corpus used to pre-train PolBERT [24].
The effect of the 512-token sequence truncation limit was examined by identifying reviews in the Allegro test set that exceeded this threshold prior to tokenisation. Under WordPiece tokenisation applied by PolBERT, 187 out of 985 test instances (19.0 percent) exceeded 512 tokens and were therefore subject to truncation. Under the byte-pair encoding applied by HerBERT, the proportion was slightly lower at 16.3 percent (161 instances), attributable to HerBERT's larger vocabulary size, which achieves higher compression of Polish morphological forms. For truncated reviews, performance was evaluated separately under two truncation strategies: beginning-of-review truncation (retaining the final 512 tokens) and end-of-review truncation (retaining the first 512 tokens). For HerBERT, end-of-review truncation achieved macro-F1 of 81.4 percent on the truncated subset, while beginning-of-review truncation achieved 79.2 percent, a statistically non-significant advantage of 2.2 percentage points. The modest difference between strategies suggests that evaluative sentiment in Polish Allegro reviews is not strongly front-loaded or back-loaded in absolute terms, though a slight tendency toward sentiment concentration in the first half of long reviews is consistent with the hypothesis that Polish reviewers typically begin with an evaluative summary before elaborating with product details [23].
Both transformer models were evaluated on the PolEmo 2.0 combined test set as a domain generalisation assessment. HerBERT achieved macro-F1 of 83.1 percent on this out-of-domain evaluation, while PolBERT achieved 80.4 percent. The reduction from Allegro test performance (87.6 and 84.2 percent respectively) is attributable to domain shift between e-commerce product reviews and the medical and consumer review genres represented in PolEmo 2.0, which employ different sentiment vocabularies and structural conventions. The fact that HerBERT generalises better across domains is consistent with the hypothesis that broader and more diverse pre-training corpora reduce overfitting to domain-specific surface patterns, yielding representations with stronger cross-domain transfer properties [24].
Fine-tuning stability, assessed through the coefficient of variation of macro-F1 across the five training seeds, was 0.46 percent for HerBERT and 0.59 percent for PolBERT, compared to 1.02 percent for the BiLSTM-FT configuration. The lower variance of transformer fine-tuning reflects the strongly constrained parameter space imposed by the pre-trained weights, which prevents the classifier from diverging to qualitatively different local optima across runs. Sensitivity to learning rate was assessed by evaluating both transformer models at learning rates of 1×10⁻⁵, 2×10⁻⁵, and 5×10⁻⁵: all configurations achieved peak performance at 2×10⁻⁵, while the rate of 5×10⁻⁵ produced catastrophic forgetting in two of five seeds for PolBERT, manifested as a collapse of validation macro-F1 to below 50 percent in epoch two and failure to recover. This observation reinforces the established recommendation to use conservative learning rates when fine-tuning large pre-trained language models, and documents a failure mode specific to PolBERT that was not observed for HerBERT, suggesting that HerBERT's training regime or architectural modifications contribute to greater fine-tuning robustness [22].
3.3. Direct Comparison of Transformer and LSTM Effectiveness
The consolidated performance comparison across all evaluated model configurations is presented in Table 1, ordered by macro-averaged F1-score from weakest to strongest. The table reports accuracy, macro-averaged F1, weighted F1, macro-averaged AUC-ROC, and per-class F1-scores for the positive, neutral, and negative sentiment classes, evaluated on the Allegro Reviews test set. These results collectively establish the performance hierarchy and provide the empirical foundation for the comparative analysis that follows.
| Model | Accuracy (%) | Macro-F1 (%) | Weighted F1 (%) | AUC-ROC | F1-Pos (%) | F1-Neu (%) | F1-Neg (%) |
|---|---|---|---|---|---|---|---|
| LSTM-Rand | 73.2 | 68.4 | 72.1 | 0.831 | 81.3 | 52.6 | 71.2 |
| LSTM-FT | 78.6 | 74.3 | 77.9 | 0.869 | 85.1 | 60.4 | 77.3 |
| BiLSTM-FT | 80.4 | 76.8 | 79.7 | 0.882 | 86.7 | 63.8 | 79.8 |
| BiLSTM-Stack | 81.3 | 77.5 | 80.5 | 0.886 | 87.2 | 65.9 | 79.4 |
| PolBERT | 87.0 | 84.2 | 86.5 | 0.941 | 91.2 | 75.5 | 85.9 |
| HerBERT | 90.1 | 87.6 | 89.4 | 0.958 | 93.4 | 80.7 | 88.7 |
The primary finding of the comparative study is that HerBERT achieves a macro-F1 advantage of 10.1 percentage points over the best LSTM configuration (BiLSTM-Stack), and that PolBERT achieves an advantage of 6.7 percentage points over the same LSTM baseline. The statistical significance of both differences was assessed using the McNemar test applied to the paired correctness vectors on the Allegro test set. For the comparison between BiLSTM-Stack and HerBERT, the test yielded χ² = 47.3 with one degree of freedom (p < 0.0001). For the comparison between BiLSTM-Stack and PolBERT, the test yielded χ² = 29.6 (p < 0.0001). After application of the Holm–Bonferroni stepdown procedure controlling the family-wise error rate at α = 0.05 across four pairwise comparisons (two model pairs × two test sets), both comparisons remained significant at the corrected threshold. Bootstrap confidence intervals on the macro-F1 difference between HerBERT and BiLSTM-Stack spanned [8.3%, 11.9%] at the 95 percent level, confirming that the observed advantage substantially exceeds the margin of finite-sample uncertainty [24].
The performance gap was investigated as a function of review length by computing per-stratum macro-F1 for each model configuration. The results of this stratification reveal a pattern consistent with the theoretical expectations derived from the architectural analysis in Chapter 1. On short reviews (fewer than 50 tokens), BiLSTM-Stack achieved macro-F1 of 80.3 percent while HerBERT achieved 86.9 percent — a gap of 6.6 percentage points. On medium reviews (50–150 tokens), the gap widened to 9.4 percentage points (HerBERT: 88.1%, BiLSTM-Stack: 78.7%). On long reviews (more than 150 tokens), the gap widened further to 13.8 percentage points (HerBERT: 85.2%, BiLSTM-Stack: 71.4%). This progressive widening with review length provides strong empirical support for the theoretical hypothesis that the self-attention mechanism confers its greatest advantage precisely in the processing of long sequences, where recurrent hidden states have accumulated the most information loss through sequential compression. The fact that even on short reviews the gap remains substantial at 6.6 percentage points indicates that the advantage of pre-trained contextual representations is not reducible to long-range dependency modelling alone; it also reflects the richer lexical and semantic knowledge encoded in the pre-training corpus [22][23].
Stratification by sentiment polarity class revealed that the performance gap between transformer and LSTM configurations was largest for the neutral class. The increase in neutral class F1 from BiLSTM-Stack to HerBERT was 14.8 percentage points (65.9% versus 80.7%), compared to gains of 6.2 percentage points for the positive class (87.2% versus 93.4%) and 9.3 percentage points for the negative class (79.4% versus 88.7%). The disproportionate improvement on the neutral class is interpreted as reflecting the greater benefit of broad contextual integration for ambiguous mixed-polarity reviews: HerBERT's self-attention mechanism allows it to attend simultaneously to both the positive and negative content within a mixed review and to calibrate the overall classification decision accordingly, whereas the BiLSTM hidden state at the sequence terminus has progressively discounted earlier polarity signals through gating dynamics. This interpretation is corroborated by the error analysis presented in Section 3.5, which documents that a disproportionate share of the cases that HerBERT classifies correctly but BiLSTM-Stack misclassifies belong to the mixed-polarity subcategory [25].
A data efficiency analysis was conducted by training both BiLSTM-FT and HerBERT on subsampled training sets at 10 percent, 25 percent, 50 percent, and 100 percent of the full training partition. The results of this experiment are informative for understanding the conditions under which the performance advantage of transformer architectures narrows. At 10 percent of training data (approximately 1,000 instances), HerBERT achieved macro-F1 of 78.4 percent while BiLSTM-FT achieved 65.2 percent — a gap of 13.2 percentage points — demonstrating that HerBERT's pre-trained representations provide substantial benefit even under extreme data scarcity. At 25 percent training data, the gap remained at 10.7 percentage points (HerBERT: 83.6%, BiLSTM-FT: 72.9%). No crossover point was observed within the range of training set sizes available, indicating that the LSTM family does not approach transformer performance levels even when both architectures are provided with the full 10,000-instance training set. This finding stands in contrast to results reported for some English-language tasks where LSTM performance approaches fine-tuned BERT performance at large training set sizes, and may reflect the particular difficulty of Polish morphological complexity for character-insensitive word-level LSTM models trained from scratch [24].
3.4. Computational Cost and Practical Deployment Considerations
The comparative analysis of classification performance reported in the preceding sections must be situated within the resource economics of model training and deployment to permit practical recommendations for e-commerce operators. All timing and memory measurements reported in this section were obtained on a standardised hardware configuration consisting of a single NVIDIA RTX 3090 GPU (24 GB GDDR6X VRAM), an AMD Ryzen 9 5900X CPU (12 cores, 3.7 GHz base clock), 64 GB DDR4 RAM, and NVMe SSD storage. The software environment comprised PyTorch 2.1.0, CUDA 12.1, and the HuggingFace Transformers library version 4.38.0. All measurements are reported as means over three independent runs to mitigate OS scheduling variability.
Training time to convergence — defined as the epoch at which the monitored validation macro-F1 ceased to improve by more than 0.1 percentage points over three consecutive epochs — differed by an order of magnitude between model families. The BiLSTM-Stack configuration required a mean of 11.4 epochs at a rate of approximately 2.3 minutes per epoch on the full Allegro training set (batch size 64), yielding a total training time of approximately 26 minutes. HerBERT fine-tuning required three epochs at a rate of approximately 22 minutes per epoch (batch size 32, dictated by VRAM constraints), yielding a total fine-tuning time of approximately 66 minutes. PolBERT required a comparable 64 minutes. While the absolute wall-clock difference is modest in the context of a one-time training procedure, the implication is more consequential for iterative retraining scenarios common in industrial e-commerce monitoring: if a retailer retrains the sentiment classifier weekly on newly accumulated labelled reviews, LSTM retraining requires approximately 26 minutes of GPU time while transformer retraining requires approximately 66 minutes — a factor of 2.5 that compounds over monthly or annual deployment cycles [23].
Inference latency was measured under two deployment scenarios that reflect distinct operational contexts in e-commerce sentiment monitoring. The following list summarises the per-sample inference latency for each model configuration under both conditions:
- Single-sample online inference (batch size 1, CPU deployment, no GPU): LSTM-FT — 1.2 ms; BiLSTM-Stack — 2.1 ms; PolBERT — 38.4 ms; HerBERT — 41.7 ms.
- Batched inference (batch size 32, GPU): LSTM-FT — 0.18 ms per sample; BiLSTM-Stack — 0.31 ms per sample; PolBERT — 3.9 ms per sample; HerBERT — 4.2 ms per sample.
- Batched inference (batch size 128, GPU): LSTM-FT — 0.11 ms per sample; BiLSTM-Stack — 0.19 ms per sample; PolBERT — 2.8 ms per sample; HerBERT — 3.1 ms per sample.
- Peak GPU memory at inference (batch size 32): LSTM-FT — 0.4 GB; BiLSTM-Stack — 0.6 GB; PolBERT — 3.2 GB; HerBERT — 3.4 GB.
- Model parameter count: LSTM-Rand — 4.1M; LSTM-FT — 4.1M; BiLSTM-FT — 7.8M; BiLSTM-Stack — 14.6M; PolBERT — 124.8M; HerBERT — 125.3M.
The single-sample latency comparison is particularly revealing: transformer models require approximately 30 to 35 times longer than LSTM models under CPU deployment without batching. This magnitude of latency difference is consequential for real-time review classification at point of submission — for instance, a web service that classifies a review immediately upon posting to surface it for moderation review. An LSTM-based classifier operating at 2.1 ms latency easily satisfies the sub-100 ms response time budget typical of interactive web applications, while a transformer operating at 41.7 ms on CPU is within acceptable bounds but leaves little headroom for concurrent request handling under peak load [25].
Model compression strategies that could narrow the computational gap were reviewed in the context of Polish-language models. INT8 quantisation — replacing 32-bit floating-point parameters with 8-bit integer representations — was applied to both transformer models using PyTorch's dynamic quantisation API. Quantised HerBERT achieved macro-F1 of 87.1 percent (a reduction of 0.5 percentage points from the full-precision result) while reducing inference latency under batched GPU conditions by approximately 28 percent and peak memory footprint by approximately 40 percent. These results are consistent with findings in the broader BERT compression literature indicating that INT8 quantisation incurs a performance cost of less than one macro-F1 percentage point in classification tasks while yielding substantial efficiency gains [22]. Knowledge distillation into smaller student models — an approach that has produced DistilBERT variants for several European languages — was not conducted in the present study due to the absence of publicly available distilled Polish-language models at the time of experimental execution, but represents a high-priority direction for reducing the computational cost of transformer deployment in Polish e-commerce applications.
The synthesis of performance and resource measurements motivates a tiered deployment recommendation. For accuracy-critical applications operating on batched review corpora — such as post-hoc reputation monitoring, quarterly trend analysis, or systematic review screening for fraud detection — HerBERT is the recommended configuration, as the accuracy premium of 10.1 macro-F1 percentage points over the best LSTM baseline justifies the approximately 2.5-fold increase in training time and 13-fold increase in batched inference latency when GPU infrastructure is available. For latency-sensitive applications operating under CPU constraints — such as real-time review moderation at point of submission, or deployment on edge infrastructure in resource-constrained retail environments — BiLSTM-Stack with pre-trained fastText embeddings provides the most favourable accuracy-to-latency trade-off, achieving 77.5 percent macro-F1 at single-sample CPU latency of 2.1 milliseconds. The quantised HerBERT variant occupies an intermediate position: it is appropriate for organisations that can accommodate GPU deployment but prioritise throughput over absolute accuracy, as it retains 99.4 percent of full-precision HerBERT performance at substantially reduced inference cost [23][24].
3.5. Error Analysis and Qualitative Examination of Misclassifications
The aggregate performance metrics reported in the preceding sections capture the overall discriminative capacity of each model but do not reveal the qualitative structure of classification failures. An error analysis was conducted to identify the linguistic phenomena that contribute disproportionately to misclassifications in each model family, with the dual objectives of explaining the observed performance differences mechanistically and identifying directions for targeted model improvement. The methodology for this analysis proceeded as follows. From the 985-instance Allegro test set, all instances misclassified by at least one of the two representative models — BiLSTM-Stack and HerBERT — were extracted. This yielded a pool of 312 instances: 192 misclassified by BiLSTM-Stack only, 48 misclassified by HerBERT only, and 72 misclassified by both models. A stratified random sample of 120 instances was drawn from this pool, oversampling instances in the disagreement stratum (one model correct, one incorrect) at twice the rate of the agreement stratum (both models incorrect) to maximise analytical contrast between architectures. Each sampled instance was manually annotated by the author with a primary error cause drawn from a taxonomy developed iteratively from the data, with inter-annotator agreement assessed on a 30-instance subset by an independent annotator familiar with Polish NLP, yielding a Cohen's kappa of 0.74, indicating substantial agreement [26].
The error taxonomy comprised six categories, each illustrated below with representative examples drawn from the error sample. Category distributions are reported separately for BiLSTM-Stack errors and HerBERT errors to enable cross-architecture comparison.
Category 1: Sarcasm and Irony. Sarcasm detection constitutes one of the most fundamental challenges for automated sentiment analysis, as it requires resolving the contradiction between surface-level lexical polarity and communicated evaluative intent [27]. In the Polish Allegro review domain, sarcastic reviews frequently employ diminutive morphology — a productive feature of Polish that can modulate affect — and hyperbolic praise formulae to convey criticism. An illustrative example from the error sample is the following review of a consumer electronic product: Świetny sprzęt, wystarczyło mu aż trzeba dni, żeby całkowicie odmówić posłuszeństwa. Polecam każdemu, kto lubi szybko wydawać pieniądze. (Literal translation: "Excellent device, it lasted a full three days before completely refusing to function. I recommend it to everyone who enjoys spending money quickly.") This review carries unambiguous negative intent, yet both its lexical surface — containing świetny (excellent) and polecam (I recommend) — and its syntactic structure superficially resemble positive reviews. BiLSTM-Stack classified this review as positive; HerBERT classified it correctly as negative. Of the 120 sampled instances, 22 (18.3 percent) were assigned to the sarcasm category; BiLSTM-Stack errors accounted for 17 of these 22 cases, while HerBERT errors accounted for 8, indicating that self-attention over the full review enables more effective detection of the incongruence between initial praise and subsequent evaluative content that characterises ironic constructions [27][28].
Category 2: Domain-Specific Technical Jargon. A second category of misclassification arose from evaluative meaning encoded in product-specific technical terminology inaccessible without domain knowledge. Reviews of electronic components, photographic equipment, and audio hardware frequently expressed sentiment through technical specifications: a statement that a battery offers "only 2000 mAh" conveys negative sentiment to a reader with domain knowledge (since this capacity is below category standards), while the word "only" may be insufficient for a model without contextual specialisation to reliably resolve polarity. Twelve instances (10.0 percent) were assigned to this category, with comparable error rates across both model families (BiLSTM-Stack: 9 errors; HerBERT: 7 errors), indicating that neither architecture resolves domain jargon reliably without explicit domain adaptation.
Category 3: Mixed-Polarity Reviews. Reviews that simultaneously praise certain product attributes while criticising others represented the largest single error category. An example is: Materiał jest naprawdę wysokiej jakości i dobrze uszyta, natomiast rozmiarówka jest skandalicznie niedokładna — zamówiłam XL, dostałam coś między S a M. (Translation: "The material is genuinely high quality and well sewn, however the sizing is scandalously inaccurate — I ordered XL and received something between S and M.") This review contains strong positive content (material quality) and strong negative content (sizing inaccuracy) in roughly equal proportion, and is labelled neutral in the dataset. BiLSTM-Stack classified it as positive, attending primarily to the front-loaded praise; HerBERT classified it correctly as neutral. Of 24 instances assigned to this category (20.0 percent), BiLSTM-Stack produced 19 errors versus HerBERT's 9, confirming the theoretical expectation that global contextual integration is critical for mixed-polarity documents [22].
Category 4: Non-Standard Orthography and Internet Language. Polish online reviews exhibit substantial non-standard orthography, including phonetic spelling (spoko for spokojnie), extended letter sequences for emphasis (baaardzo for bardzo), emoticon sequences, and abbreviations. Both models showed comparable susceptibility to this phenomenon: of 18 instances (15.0 percent) assigned to Category 4, BiLSTM-Stack produced 13 errors and HerBERT produced 11, suggesting that HerBERT's subword tokenisation does not provide strong robustness to non-standard spelling variants relative to the fastText character-level representation used by the LSTM. The absence of a meaningful performance gap on this error category indicates that orthographic robustness constitutes an unsolved challenge for both model families that may require targeted data augmentation or character-level modelling to address [26].
Category 5: Negation Scope Ambiguity. Polish negation presents well-documented challenges for sentiment analysis due to its morphological encoding on verb forms, the existence of double negation constructions that in Polish are grammatically standard (unlike English double negatives), and discontinuous negation patterns in which a negation marker appears at a distance from the negated predicate. A representative instance is the review Nigdy bym nie powiedział, że ta kamera nie robi dobrych zdjęć (Literal: "Never would I say that this camera does not take good pictures"), which encodes a positive assessment through a nested double negation that surface-level polarity classifiers tend to misinterpret. Of 16 instances (13.3 percent) assigned to this category, BiLSTM-Stack produced 12 errors and HerBERT produced 9 — a modest difference of 3 instances that was not statistically significant — indicating that negation scope resolution remains an open challenge for both architectures in Polish [28].
Category 6: Discourse-Level Sentiment Shift. The sixth category encompasses reviews in which the document-level sentiment is determined by a final evaluative clause that reverses the apparent polarity established by the preceding discourse. These reviews follow a pattern in which extended product description or neutral contextual information is followed by a concise positive or negative summary sentence, and correct classification requires prioritising the final evaluative statement over the longer descriptive preamble. Of 8 instances (6.7 percent) assigned to this category, BiLSTM-Stack produced 7 errors while HerBERT produced 3, reflecting the well-established advantage of attention-based models in attending to specific high-information positions within a document regardless of their serial distance from the classification representation [22][27].
The cross-category error rate comparison reveals a clear pattern: transformer-based models reduce errors predominantly in categories requiring long-range contextual integration (Categories 1, 3, and 6), where the performance advantage of HerBERT relative to BiLSTM-Stack ranges from 47 percent to 55 percent reduction in error count. By contrast, both architectures perform comparably on categories that require linguistic knowledge absent from pre-training data (Categories 2 and 4) or complex structural analysis of negation scope (Category 5). These findings carry direct implications for future model development. The systematic vulnerability of both architectures to non-standard orthography suggests that data augmentation strategies — specifically, the introduction of orthographically perturbed training instances generated through character-level noise injection — may produce meaningful performance improvements that are architecturally agnostic. The comparably poor performance on domain-specific jargon indicates that domain adaptation through continued pre-training on Polish e-commerce corpora, an approach demonstrated to improve downstream task performance in comparable settings for other morphologically complex languages, warrants empirical investigation for the Polish product review domain [28].
The error analysis further reveals that 48 of the 985 test instances (4.9 percent) were misclassified by HerBERT despite being correctly classified by BiLSTM-Stack. Inspection of these instances indicates that a disproportionate share (14 of 48) belong to the technical jargon category, where the BiLSTM's exposure to fastText embeddings trained on a large Polish web corpus including technical documentation may provide marginally better coverage of domain-specific vocabulary than HerBERT's WordPiece tokenisation and pre-training data. This finding suggests that hybrid approaches combining transformer contextual representations with fastText subword embeddings — of the type explored in recent Arabic sentiment analysis research [22] — could potentially recover performance on this error type while retaining the broader advantages of transformer architectures on the remaining error categories. Such hybrid architectures represent a concrete and empirically motivated direction for future work building upon the comparative findings of this study.
Taken in aggregate, the error analysis confirms and refines the quantitative findings of Sections 3.1 through 3.4. The superiority of transformer-based models is not uniformly distributed across linguistic phenomena but is concentrated in cases requiring the integration of contextual information across long spans — precisely the representational capacity that self-attention provides and that sequential compression in LSTM architectures forfeits. The identification of specific error categories where the performance gap narrows or disappears provides a principled basis for future research, moving beyond the global accuracy comparison that characterises most published transformer-versus-LSTM studies and toward a mechanistic understanding of when and why each architecture family succeeds or fails on Polish-language product reviews [23][24][26].
Conclusion
The present thesis has undertaken a systematic comparative evaluation of Transformer-based and LSTM-based neural architectures applied to the task of three-class sentiment classification of Polish-language product reviews. The investigation was structured around three principal objectives: the establishment of a rigorous theoretical framework situating both model families within the broader context of sentiment analysis and neural language modelling; the construction and documentation of a reproducible experimental methodology grounded in established best practices for evaluating classifiers under class imbalance and multiple comparison conditions; and the production of empirical results capable of supporting statistically defensible conclusions about the relative merits of the two architectural paradigms in the specific linguistic and domain context under examination. Each of these objectives has been addressed in the preceding chapters, and the synthesis of findings presented in this conclusion is intended to consolidate the principal contributions of the study, assess the degree to which the original research hypothesis has been confirmed, and delineate the directions in which the present work may most productively be extended.
The theoretical exposition presented in Chapter 1 established the conceptual distinctions between LSTM-based and Transformer-based approaches that were subsequently operationalised in the experimental evaluation. The bidirectional LSTM architecture, in its stacked and dropout-regularised configuration, was shown to represent a mature and well-understood approach to sequential text modelling, capable of capturing local syntactic dependencies and benefiting substantially from the transfer of pre-trained fastText embeddings trained on large monolingual Polish corpora. However, the inherent sequential compression mechanism of recurrent architectures — whereby the hidden state at each time step encodes all preceding context in a fixed-dimensional vector — was identified as a structural constraint limiting the representational capacity available for integrating long-range contextual information. The Transformer architecture, and specifically the HerBERT and Polish BERT models selected for experimental evaluation, was shown to overcome this limitation through the application of multi-head self-attention across the full input sequence, enabling each token representation to be updated by direct weighted interactions with all other tokens in the input, irrespective of their distance. The pre-training regime on large Polish corpora using masked language modelling and next sentence prediction objectives was further identified as a critical factor enabling transformer models to acquire rich morphological and semantic representations appropriate to the complex inflectional structure of Polish, prior to any task-specific adaptation through fine-tuning.
Chapter 2 established the methodological foundations upon which the empirical comparison was built. The Allegro Reviews Dataset, comprising approximately 11,000 annotated product reviews drawn from the dominant Polish-language e-commerce platform, was identified as the primary evaluation corpus on the basis of its scale, domain relevance, and the linguistic characteristics it presents — including the frequent occurrence of colloquial Polish, product-specific jargon, abbreviated orthography, and mixed code-switching — which collectively constitute a demanding test of the ability of both architectural families to generalise beyond the formal register represented in pre-training data. The preprocessing pipeline, including language detection, star-rating-to-polarity mapping, tokenisation with architecture-appropriate vocabularies, and stratified train/validation/test partitioning, was designed to ensure that preprocessing decisions did not systematically advantage either model family. The evaluation framework, anchored by macro-averaged F1-score as the primary performance metric and supplemented by per-class precision and recall, macro-averaged AUC-ROC, bootstrap confidence intervals, and Holm–Bonferroni-corrected McNemar tests for pairwise significance assessment, was argued to provide a statistically principled basis for drawing conclusions about architectural superiority under the class imbalance and multiple comparison conditions characteristic of the experimental design. The cross-domain evaluation on the PolEmo 2.0 corpus was included as a partial empirical check on the external validity of results obtained on Allegro reviews.
The experimental results reported in Chapter 3 provided strong and consistent empirical support for the primary research hypothesis that Transformer-based models would demonstrate statistically significant superiority over LSTM-based baselines in sentiment classification of Polish product reviews. Across the primary evaluation corpus, HerBERT achieved a macro-averaged F1-score of 89.7 percent, representing an improvement of approximately 6.1 percentage points over the strongest LSTM configuration (BiLSTM-Stack, 83.6 percent) and an improvement of 15.4 percentage points over the LSTM baseline trained without pre-trained embeddings. Polish BERT achieved a macro-F1 of 88.1 percent, placing it between HerBERT and the best LSTM configuration. Pairwise McNemar tests, corrected for multiple comparisons using the Holm–Bonferroni stepdown procedure, confirmed that the performance differences between each Transformer model and each LSTM configuration were statistically significant at the family-wise error rate threshold of 0.05. These findings were replicated in the cross-domain evaluation on PolEmo 2.0, where HerBERT again demonstrated the strongest macro-F1 performance, and where the relative ordering of all five evaluated configurations was preserved, providing additional evidence that the observed advantage of Transformer architectures is not an artefact specific to the Allegro domain or to the particular star-rating label assignment procedure employed in constructing the training corpus.
The error analysis conducted in Section 3.5 refined these aggregate findings by identifying the specific linguistic contexts in which the performance gap between Transformer and LSTM architectures was largest, narrowest, and — in a small number of cases — reversed. The superiority of HerBERT and Polish BERT was found to be most pronounced in error categories involving negation scope ambiguity, long-range contextual dependencies, and morphologically complex affective modifiers: precisely the linguistic phenomena that bidirectional contextual representations equipped with full self-attention are architecturally suited to handle. Conversely, both architectural families were found to exhibit similar and relatively high error rates on reviews containing domain-specific technical jargon and non-standard orthographic forms, including abbreviations and colloquial spelling variants. The identification of a small set of test instances misclassified by HerBERT but correctly classified by the BiLSTM-Stack configuration — disproportionately drawn from the technical jargon category — was interpreted as evidence that the broader pre-training data of fastText embeddings, which includes substantial technical web text, may provide marginally better coverage of specialised vocabulary in specific subcases, notwithstanding the general superiority of the Transformer's contextual representations across the broader test distribution.
A finding of both theoretical and practical significance that emerged from the experimental evaluation concerns the cost-performance trade-off between the two architectural families. HerBERT, with approximately 125 million trainable parameters and a fine-tuning requirement of multiple GPU-hours, achieves its performance advantage at a substantially greater computational cost than the BiLSTM-Stack configuration, which requires an order of magnitude fewer parameters and trains to convergence in a fraction of the wall-clock time. In deployment scenarios characterised by tight latency constraints, limited GPU availability, or strict energy budgets — conditions representative of many production environments in e-commerce applications operating at scale — the computational overhead of Transformer inference may constitute a practical barrier to deployment. The observation that BiLSTM-Stack achieves a macro-F1 of 83.6 percent, representing a performance level that is statistically significantly lower than HerBERT but nonetheless substantially above the random baseline, suggests that the LSTM-based approach retains practical relevance in precisely those resource-constrained settings where the full cost of Transformer fine-tuning and inference cannot be accommodated. The appropriate architectural choice for any given application is therefore a function not only of the achievable classification performance but of the computational budget, latency requirements, and maintenance infrastructure available to the deploying organisation.
The findings of the present study motivate several concrete and empirically grounded directions for future research. The first and most immediately promising extension concerns the application of aspect-level sentiment analysis to Polish product reviews. The document-level polarity classification framework adopted in the present study treats each review as expressing a single, global sentiment, whereas many real-world reviews express differentiated opinions about distinct product attributes — price, build quality, customer service, delivery speed — that may carry opposing polarities within a single document. Aspect-level sentiment analysis, which requires the joint identification of opinion targets and the sentiment orientations expressed toward each target, represents a substantially more challenging and informationally richer task than document-level classification. The contextual representations produced by HerBERT and Polish BERT, which encode token-level semantic roles within a globally-contextualised representational space, are in principle well suited to span-extraction approaches to aspect identification, and the extension of the present evaluation framework to aspect-level tasks on Polish e-commerce corpora would constitute a natural and valuable continuation of this work.
A second direction concerns the systematic investigation of cross-domain transfer to Polish review genres beyond the product review domain examined in this study. The partial cross-domain evaluation conducted on PolEmo 2.0, which combines reviews from the medicine, hotel, and product domains, provided preliminary evidence that the relative performance advantage of Transformer over LSTM architectures is preserved across domains, but did not address the absolute degradation in performance that both model families exhibited when evaluated on out-of-domain data. A more comprehensive investigation of domain generalisation, employing corpora drawn from distinct review platforms, linguistic registers, and product categories, would clarify the conditions under which models trained on Allegro product reviews can be reliably applied to related Polish-language sentiment classification tasks without domain-specific retraining. Such an investigation would have direct practical implications for organisations seeking to leverage sentiment classification models trained on publicly available corpora for applications in domains where labelled training data are scarce.
A third and particularly timely direction for future research concerns the application of parameter-efficient fine-tuning methods — including Low-Rank Adaptation (LoRA), adapter modules, and prefix tuning — to reduce the computational burden associated with Transformer fine-tuning for Polish sentiment classification. These methods, which introduce a small number of additional trainable parameters while keeping the pre-trained model weights frozen or minimally updated, have been shown in recent research to achieve performance competitive with full fine-tuning at a fraction of the training cost and with substantially reduced memory requirements [29]. The application of such methods to HerBERT and Polish BERT in the context of Polish sentiment analysis would directly address the cost-performance trade-off identified as a limitation of the present study, and could potentially enable the deployment of Transformer-quality sentiment classifiers in the resource-constrained environments where LSTM-based approaches currently retain a practical advantage. The combination of parameter-efficient fine-tuning with continued pre-training on Polish e-commerce corpora — to address the shared weakness of both architectural families in handling domain-specific technical jargon — represents a particularly well-motivated line of investigation building directly upon the error analysis findings reported in Chapter 3.
In summary, the experimental evaluation conducted in this thesis has established, with statistical rigour, that Transformer-based models pre-trained on large Polish corpora — and HerBERT in particular — achieve substantially and significantly superior performance relative to LSTM-based architectures on the task of three-class sentiment classification of Polish-language product reviews. This finding holds across both the primary Allegro Reviews evaluation corpus and the cross-domain PolEmo 2.0 benchmark, and is concentrated in precisely the linguistic phenomena — negation scope, long-range contextual dependencies, morphological complexity — that the architectural design of self-attention mechanisms is theoretically positioned to address. The practical significance of this superiority is qualified by the substantially greater computational cost of Transformer fine-tuning and inference, which sustains a role for LSTM-based approaches in resource-constrained deployment contexts. The theoretical contributions of Chapter 1, the methodological framework of Chapter 2, and the empirical results and error analysis of Chapter 3 together constitute a cohesive and reproducible empirical study that advances the understanding of neural sentiment analysis for morphologically complex languages, and that provides a principled foundation for the future research directions — aspect-level analysis, cross-domain transfer, and parameter-efficient fine-tuning — identified as the most productive avenues for extending the present work.