Which open source LLM models best suit text classification tasks?

Insight from top 10 papers

Open Source LLM Models for Text Classification

Overview of Text Classification Tasks

Text classification is a fundamental NLP task that involves assigning a category or label to a given text input
Common text classification tasks include:
- Sentiment analysis (e.g. positive/negative)
- Topic classification (e.g. politics, sports, technology)
- Intent classification (e.g. question answering, customer service requests)
- Named entity recognition (e.g. identifying people, locations, organizations)

Comparison of Open-Source vs Closed-Source LLMs

Open-Source LLMs

Examples: , LLaMA, Bloom
Advantages:
- Transparent model architecture and training data
- Accessible for research and development
- Potential for community-driven improvements
Challenges:
- May not match the performance of closed-source models like GPT-3.5 and GPT-4
- Require careful fine-tuning and prompting to achieve strong results on classification tasks

Closed-Source LLMs

Examples: GPT-3.5, GPT-4
Advantages:
- Often demonstrate state-of-the-art performance on a variety of NLP tasks, including text classification
- Benefit from extensive training data and computational resources
Challenges:
- Lack of transparency in model architecture and training data
- Restricted access and high costs, limiting research and development
- Potential ethical and privacy concerns due to black-box nature

Strategies for Using Open-Source LLMs for Text Classification

Fine-Tuning Techniques

Fine-tuning open-source LLMs like LLaMA and Bloom on specific text classification datasets can help them rival the performance of closed-source models
Techniques like parameter-efficient fine-tuning (PEFT) and Low-Rank Adaptation (LoRA) can improve the efficiency of fine-tuning
Smaller, supervised models like RoBERTa can also achieve strong performance on many text classification tasks, sometimes outperforming larger generative LLMs

Prompting Strategies

Carefully crafting prompts can help open-source LLMs perform better on text classification tasks
Techniques like in-context learning, where the model is provided with a few examples of the target task in the prompt, can boost performance
Leveraging the semantic content of class labels and the similarity of in-context examples to the current input can further improve in-context learning

Open-Source Toolkits for Text Classification

NeuralClassifier

An open-source toolkit for neural hierarchical multi-label text classification
Supports a variety of text encoders, including Transformer-based models, and can be used for binary, multi-class, and multi-label classification tasks

C-NARS/P

An open-source tool for classifying narratives in survey data using pre-trained language models
Can be modified for a wide range of text classification tasks with ease

Dive deep into the question

Source Papers (10)

PyTAIL: An Open Source Tool for Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data

Open, Closed, or Small Language Models for Text Classification?

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Universal Language Model Fine-tuning for Text Classification

NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Improving Medical Abstract Classification Using PEFT-LoRA Fine-Tuned Large and Small Language Models

Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

C-NARS/P: An Open-Source Tool for Classification of Narratives in Survey Data Using Pre-Trained Language Models

In-Context Learning for Text Classification with Many Labels