legalkit-retrieval / README.md
louisbrulenaudet's picture
Upload 11 files
6b2dcd4 verified
metadata
title: LegalKit Retrieval
emoji: 📖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.25.0
app_file: app.py
pinned: true
license: apache-2.0
short_description: A binary Search with Scalar Rescoring through legal codes

LegalKit Retrieval, a binary Search with Scalar (int8) Rescoring through French legal codes

This space showcases the tsdae-lemone-mbert-base model by Louis Brulé Naudet, a sentence embedding model based on BERT fitted using Transformer-based Sequential Denoising Auto-Encoder for unsupervised sentence embedding learning with one objective : french legal domain adaptation.

This process is designed to be memory efficient and fast, with the binary index being small enough to fit in memory and the int8 index being loaded as a view to save memory. In total, this process requires keeping 1) the model in memory, 2) the binary index in memory, and 3) the int8 index on disk.

Additionally, the binary index is much faster (up to 32x) to search than the float32 index, while the rescoring is also extremely efficient. In conclusion, this process allows for fast, scalable, cheap, and memory-efficient retrieval.

Notes:

  • The SentenceTransformer model currently in use is in beta and may not be suitable for direct use in production.

Dependencies

Libraries Used:

  • Accelerate (v0.29.1): A Python library for high-performance computing, enabling faster execution of computational tasks.
  • Faiss-GPU (v1.7.2): A GPU-accelerated library for efficient similarity search and clustering of dense vectors, essential for high-dimensional data analysis.
  • Gradio (v4.25.0): An intuitive library for creating customizable UI components around machine learning models, simplifying model deployment and interaction.
  • Polars (v0.20.18): A blazing-fast DataFrame library for Rust, providing efficient data manipulation capabilities for large datasets.
  • Sentence-Transformers (v2.6.1): A versatile library for generating sentence embeddings, facilitating various natural language processing tasks such as semantic similarity and text classification.
  • Spaces (v0.25.0): A utility library designed to optimize GPU resource management, enhancing efficiency and scalability in GPU-based computing environments.
  • Usearch (v2.10.5): A powerful library for performing fast approximate nearest neighbor search, crucial for tasks like recommendation systems and data clustering.

Installation Guide

To install all the dependencies, you can use the following command:

pip3 install accelerate faiss-gpu gradio polars sentence-transformers spaces usearch

Note: Ensure you have Python installed on your system before proceeding with the installation of these libraries.

Citing this project

If you use this code in your research, please use the following BibTeX entry.

@misc{louisbrulenaudet2024,
    author = {Louis Brulé Naudet},
    title = {LegalKit Retrieval, a binary Search with Scalar (int8) Rescoring through French legal codes},
    howpublished = {\url{https://huggingface.co/spaces/louisbrulenaudet/legalkit-retrieval}},
    year = {2024}
}

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.