Application of deep learning techniques to automatic classification: CNAE as a case study
This article presents the latest developments within the CIDMEFEO project (Data Science and Engineering for the Improvement of Official Statistics) in the field of automatic coding. Our goal is to develop a deep learning automatic encoder for multiclass classification. The work focuses on the Spanish National Classification of Economic Activities (CNAE), which organizes the economic activity of companies into a four-level hierarchy comprising 664 final categories. CNAE was chosen as a case study due to both its relevance and complexity. Our proposal combines deep learning techniques and LLMs to generate synthetic training data from category descriptions. Then, we employ BERT-type models for classification, including multi-class and hierarchical approaches. Our results indicate improved classification performance, up to 15% when using generated data, while increasing efficiency (by reducing computational costs compared to using LLMs) and security (by reducing the use of real data)
Palabras clave: Deep learning machine learning text classification transformers CNAE