Which open source LLM models best suit text classification tasks?

Insight from top 10 papers

Open Source LLM Models for Text Classification

Overview of Text Classification Tasks

  • Text classification is a fundamental NLP task that involves assigning a category or label to a given text input
  • Common text classification tasks include:
    • Sentiment analysis (e.g. positive/negative)
    • Topic classification (e.g. politics, sports, technology)
    • Intent classification (e.g. question answering, customer service requests)
    • Named entity recognition (e.g. identifying people, locations, organizations)

Comparison of Open-Source vs Closed-Source LLMs

Open-Source LLMs

  • Examples: , LLaMA, Bloom
  • Advantages:
    • Transparent model architecture and training data
    • Accessible for research and development
    • Potential for community-driven improvements
  • Challenges:
    • May not match the performance of closed-source models like GPT-3.5 and GPT-4
    • Require careful fine-tuning and prompting to achieve strong results on classification tasks

Closed-Source LLMs

  • Examples: GPT-3.5, GPT-4
  • Advantages:
    • Often demonstrate state-of-the-art performance on a variety of NLP tasks, including text classification
    • Benefit from extensive training data and computational resources
  • Challenges:
    • Lack of transparency in model architecture and training data
    • Restricted access and high costs, limiting research and development
    • Potential ethical and privacy concerns due to black-box nature

Strategies for Using Open-Source LLMs for Text Classification

Fine-Tuning Techniques

  • Fine-tuning open-source LLMs like LLaMA and Bloom on specific text classification datasets can help them rival the performance of closed-source models
  • Techniques like parameter-efficient fine-tuning (PEFT) and Low-Rank Adaptation (LoRA) can improve the efficiency of fine-tuning
  • Smaller, supervised models like RoBERTa can also achieve strong performance on many text classification tasks, sometimes outperforming larger generative LLMs

Prompting Strategies

  • Carefully crafting prompts can help open-source LLMs perform better on text classification tasks
  • Techniques like in-context learning, where the model is provided with a few examples of the target task in the prompt, can boost performance
  • Leveraging the semantic content of class labels and the similarity of in-context examples to the current input can further improve in-context learning

Open-Source Toolkits for Text Classification

NeuralClassifier

  • An open-source toolkit for neural hierarchical multi-label text classification
  • Supports a variety of text encoders, including Transformer-based models, and can be used for binary, multi-class, and multi-label classification tasks

C-NARS/P

  • An open-source tool for classifying narratives in survey data using pre-trained language models
  • Can be modified for a wide range of text classification tasks with ease
Source Papers (10)
PyTAIL: An Open Source Tool for Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data
Open, Closed, or Small Language Models for Text Classification?
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
Universal Language Model Fine-tuning for Text Classification
NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models
Improving Medical Abstract Classification Using PEFT-LoRA Fine-Tuned Large and Small Language Models
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
C-NARS/P: An Open-Source Tool for Classification of Narratives in Survey Data Using Pre-Trained Language Models
In-Context Learning for Text Classification with Many Labels