Which open source LLM models best suit text classification tasks?
Insight from top 10 papers
Open Source LLM Models for Text Classification
Overview of Text Classification Tasks
- Text classification is a fundamental NLP task that involves assigning a category or label to a given text input
- Common text classification tasks include:
- Sentiment analysis (e.g. positive/negative)
- Topic classification (e.g. politics, sports, technology)
- Intent classification (e.g. question answering, customer service requests)
- Named entity recognition (e.g. identifying people, locations, organizations)
Comparison of Open-Source vs Closed-Source LLMs
Open-Source LLMs
- Examples: , LLaMA, Bloom
- Advantages:
- Transparent model architecture and training data
- Accessible for research and development
- Potential for community-driven improvements
- Challenges:
- May not match the performance of closed-source models like GPT-3.5 and GPT-4
- Require careful fine-tuning and prompting to achieve strong results on classification tasks
Closed-Source LLMs
- Examples: GPT-3.5, GPT-4
- Advantages:
- Often demonstrate state-of-the-art performance on a variety of NLP tasks, including text classification
- Benefit from extensive training data and computational resources
- Challenges:
- Lack of transparency in model architecture and training data
- Restricted access and high costs, limiting research and development
- Potential ethical and privacy concerns due to black-box nature
Strategies for Using Open-Source LLMs for Text Classification
Fine-Tuning Techniques
- Fine-tuning open-source LLMs like LLaMA and Bloom on specific text classification datasets can help them rival the performance of closed-source models
- Techniques like parameter-efficient fine-tuning (PEFT) and Low-Rank Adaptation (LoRA) can improve the efficiency of fine-tuning
- Smaller, supervised models like RoBERTa can also achieve strong performance on many text classification tasks, sometimes outperforming larger generative LLMs
Prompting Strategies
- Carefully crafting prompts can help open-source LLMs perform better on text classification tasks
- Techniques like in-context learning, where the model is provided with a few examples of the target task in the prompt, can boost performance
- Leveraging the semantic content of class labels and the similarity of in-context examples to the current input can further improve in-context learning
Open-Source Toolkits for Text Classification
NeuralClassifier
- An open-source toolkit for neural hierarchical multi-label text classification
- Supports a variety of text encoders, including Transformer-based models, and can be used for binary, multi-class, and multi-label classification tasks
C-NARS/P
- An open-source tool for classifying narratives in survey data using pre-trained language models
- Can be modified for a wide range of text classification tasks with ease
Source Papers (10)
PyTAIL: An Open Source Tool for Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data
Open, Closed, or Small Language Models for Text Classification?
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
Universal Language Model Fine-tuning for Text Classification
NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models
Improving Medical Abstract Classification Using PEFT-LoRA Fine-Tuned Large and Small Language Models
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
C-NARS/P: An Open-Source Tool for Classification of Narratives in Survey Data Using Pre-Trained Language Models
In-Context Learning for Text Classification with Many Labels